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Abstract 

We present a calculus for processing semistructured data that spans 
differences of application area among several novel query lan- 
guages, broadly categorized as "NoSQL". This calculus lets users 
define their own operators, capturing a wider range of data process- 
ing capabilities, whilst providing a typing precision so far typical 
only of primitive hard-coded operators. The type inference algo- 
rithm is based on semantic type checking, resulting in type infor- 
mation that is both precise, and flexible enough to handle structured 
and semistructured data. We illustrate the use of this calculus by 
encoding a large fragment of Jaql, including operations and itera- 
tors over JSON, embedded SQL expressions, and co-grouping, and 
show how the encoding directly yields a typing discipline for Jaql 
as it is, namely without the addition of any type definition or type 
annotation in the code. 



1. Introduction 

The emergence of Cloud computing, and the ever growing impor- 
tance of data in a ppli cations, has given birth to a whirlwind of new 
data models lfl9l l24ll and languages. Whether they are developed 
under the banner of "NoSQL" J lalsll . for BigData Analytics lH, 
[Mill, for Cloud computing (3|], or as domain specific languages 
(DSL) embedded in a host language fTH [27l [331 . most of them 
share a common subset of SQL and the ability to handle semistruc- 
tured data. While there is no consensus yet on the precise bound- 
aries of this class of languages, they all share two common traits: 
(i) an emphasis on sequence operations (eg, through the popular 
MapReduce paradigm) and ( ii) a lack of types for both data and pro- 
grams (contrary to, say, XML program ming or relational databases 
where data schemas are pervasive). In 11211 l22ll . Meijer argues that 
such languages can greatly benefit from formal foundations, and 
suggests comprehensions ll7l l33l[34ll as a unifying model. Although 
we agree with Meijer for the need to provide unified, formal foun- 
dations to those new languages, we argue that such foundations 
should account for novel features critical to various application do- 
mains that are not captured by comprehensions. Also, most of those 
languages provide limited type checking, or ignore it altogether. We 
believe type checking is essential for many applications, with usage 
ranging from error detection to optimization. But we understand the 
designers and programmers of those languages who are averse to 
any kind of type definition or annotation. In this paper, we propose 
a calculus which is expressive enough to capture languages that go 
beyond SQL or comprehensions. We show how the calculus adapts 
to various data models while retaining a precise type checking that 
can exploit in a flexible way limited type information, information 
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that is deduced directly from the structure of the program even in 
the absence of any explicit type declaration or aimotation. 

Example. We use Jaql ||1,[T1], a language over JSON (T^] devel- 
oped for BigData analytics, to illustrate how our proposed calculus 
works. Our reason for using Jaql is that it encompasses all the fea- 
tures found in the previously cited query languages and includes a 
number of original ones, as well. Like Pig 112811 it supports sequence 
iteration, filtering, and grouping operations on non-nested queries. 
Like AQL iQ] and XQuery j^, it features nested queries. Further- 
more, Jaql uses a rich data model that allows arbitrary nesting of 
data (it works on generic sequences of JSON records whose fields 
can contain other sequences or records) while other languages are 
limited to flat data models, such as AQL whose data-model is sim- 
ilar to the standard relational model used by SQL databases (tuples 
of scalars and of lists of scalars). Lastly, Jaql includes SQL as an 
embedded sub-language for relational data. For these reasons, al- 
though in the present work we focus almost exclusively on Jaql, we 
believe that our work can be adapted without effort to a wide array 
of sequence processing languages. 

The following Jaql progi^am illustrates some of those features. 
It performs co-grouping Il2lll between one JSON input, containing 
information about departments, and one relational input contain- 
ing information about employees. The query returns for each de- 
partment its name and id, from the first input, and the number of 
high-income employees from the second input. A SQL expression 
is used to select the employees with income above a given value, 
while a Jaql filter is used to access the set of departments and the 
elements of these two collections are processed by the group ex- 
pression (in Jaql "$" denotes the current element). 

group 

(depts -> filter each x (x.size > 50)) 

by g = $.depid as ds, 
(SELECT * FROM employees WHERE income > 100) 
by g = $.dept as es 
into { dept : g, 

deptName : ds[0].name, 
numEmps: count (es) }■; 

The query blends Jaql expressions (eg, filter which selects, in 
the collection depts, departments with a size of more than 50 
employees, and the grouping itself) with a SQL statement (select- 
ing employees in a relational table for which the salary is more 
than 100). Relations are naturally rendered in JSON as collections 
of records. In our example, one of the key difference is that field 
access in SQL requires the field to be present in the record, while 
the same operation in Jaql does not. Actually, field selection in Jaql 
is very expressive since it can be applied also to collections with the 
effect that the selection is recursively applied to the components of 
the collection and the collection of the results returned, and simi- 
larly for filter and other iterators. In other words, the expression 
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filter each x (x.size > 50) above will work as much when 
X is bound to a record (with or without a size field: in the latter 
case the selection returns null), as when x is bound to a collection 
of records or of arbitrary nested collections thereof. This accounts 
for the semistructured nature of JSON compared to the relational 
model. Our calculus can express both, in a way that illustrates the 
difference in both the dynamic semantics and static typing. 

In our calculus, the selection of all records whose mandatory 
field income is greater than 100 is defined as: 
let Sel = 

'nil => 'nil 

I ({income: x, . . } as y , tail) => 

if X > 100 then (y,Sel(tail)) else Sel(tail) 
(collections are encoded as lists a la Lisp) while the filtering among 
records or arbitrary nested collections of records of those where the 
(optional) size field is present and larger than 50 is: 
let Fil = 

'nil => 'nil 

I ({size: x, . . ]■ as y,tail) => 

if X > 50 then (y,Fil(tail)) else Fil(tail) 

I ((x,xs) ,tail) => (Fil(x,xs) ,Fil(tail)) 

I (_,tail) => Fil(tail) 
The terms above show nearly all the basic building blocks of our 
calculus (only composition is missing), building blocks that we dub 
filters. Filters can be defined recursively (eg, Sel (tail) is a recur- 
sive call); they can perform pattern matching as found in functional 
languages (the filter p=> f executes / in the environment resulting 
from the matching of pattern p); they can be composed in alterna- 
tion (/1I/2 tries to apply /i and if it fails it applies /2), they can 
spread over the structure of their argument {eg, (/i,/2) — of which 
(x, Sel (tail)) is an instance — requires an argument of a prod- 
uct type and applies the corresponding fi component- wise). 

For instance, the filter Fil scans collections encoded as lists a 
la Lisp (ie, by right associative pairs with 'nil denoting the empty 
list). If its argument is the empty list, then it returns the empty list; 
if it is a list whose head is a record with a size field (and possibly 
other fields matched by ". ."), then it captures the whole record in 
y, the content of the field in x, the tail of the list in tail, and keeps 
or discards y {ie, the record) according to whether x {ie, the field) 
is larger than 50; if the head is also a list, then it recursively applies 
both on the head and on the tail; if the head of the list is neither a 
list, nor a record with a size field, then the head is discarded] The 
encoding of the whole grouping query is given in Section [53] 

Our aim is not to propose yet another "NoSQL/cloud comput- 
ing/bigdata analytics" query language, but rather to show how to 
express and type such languages via an encoding into our core cal- 
culus. Each such language can in this way preserve its execution 
model but obtain for free a formal semantics, a type inference sys- 
tem and, as it happens, a prototype implementation. The type infor- 
mation is deduced via the encoding (without the need of any type 
annotation) and can be used for early error detection and debugging 
purposes. The encoding also yields an executable system that can 
be used for rapid prototyping. Both possibilities are critical in most 
typical usage scenarios of these languages, where deployment is 
very expensive both in time and in resources. As observed by Mei- 
jer 1I2TI1 the advent of big data makes it more important than ever 
for programmers (and, we add, for language and system designers) 
to have a single abstraction that allows them to process, transform, 
query, analyze, and compute across data presenting utter variability 
both in volume and in structure, yielding a "mind-blowing number 
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of new data models, query languages, and execution fabrics" 1I21I1 . 
The framework we present here, we claim, encompasses them all. 
A long-term goal is that the compilers of these languages could use 
the type information inferred from the encoding and the encoding 
itself to devise further optimizations. 

Types. Pig Jaql tHH, AQL d have all been conceived 
by considering just the map-reduce execution model. The type (or, 
schema) of the manipulated data did not play any role in their de- 
sign. As a consequence these languages are untyped and, when 
present, types are optional and clearly added as an afterthought. 
Differences in data model or type discipline are particularly im- 
portant when embedded in a host language (since they yield the 
so-called impedance mismatch). The reason why types were/are 
disregarded in such languages may originate in an alleged tension 
between type inference and heterogeneous/semistructured data: on 
the one hand these languages are conceived to work with collec- 
tions of data that are weakly or partially structured, on the other 
hand cun^ent languages with type inference (such as Haskell or 
ML) can work only on homogeneous collections (typically, lists of 
elements of the same type). 

In this work we show that the two visions can coexist: we type 
data by semantic subtyping fT^ . a type system conceived for semi- 
structured data, and describe computations by our filters which are 
untyped combinators that, thanks to a technique of weak typing in- 
troduced in 13, can polymoiphically type the results of data query 
and processing with a high degree of precision. The conception of 
filters is driven by the schema of the data rather than the execution 
model and we use them (i) to capture and give a uniform semantics 
to a wide range of semi structured data processing capabilities, (ii) 
to give a type system that encompasses the types defined for such 
languages, if any, notably Pig, Jaql and AQL (but also XML query 
and processing languages: see Section [53}, {Hi) to infer the pre- 
cise result types of queries written in these languages as they are 
(so without the addition of any explicit type annotation/definition or 
new construct), and {iv) to show how minimal extensions/modifi- 
cations of the current syntax of these languages can bring dramatic 
improvements in the precision of the inferred types. 

The types we propose here are extensible record types and het- 
erogeneous lists whose content is described by regular expressions 
on types as defined by the following grammar: 



Types t ::— v (singleton) 

I {i'-t, . ■ . , £:t} (closed record) 

I {i'-t, . . . ,£:t, ..} (open record) 

I [r] (sequences) 

I int I char (base) 

I any | empty | null (special) 

I t\t (union) 

I t\t (difference) 



Regexp r ::— e \ t \ r* \ r+ \ r? \ r r \ r\r 

where e denotes the empty word. The semantics of types can be 
expressed in terms of sets of values (values are either constants 
— such as 1, 2, true, false, null, '1', the latter denoting the 
character 1 — , records of values, or lists of values). So the single- 
ton type V is the type that contains just the value v (in particular 
null is the singleton type containing the value null). The closed 
record type {a:int, b:int} contains all record values with exactly 
two fields a and b with integer values, while the open record 
type {a:int, b:int , ..} contains all record values with at least two 
fields a and b with integer values. The sequence type [r] is the set 
of all sequences whose content is described by the regular expres- 
sion r; so, for example [char*] contains all sequences of charac- 
ters (we will use string to denote this type and the standard double 
quote notation to denote its values) while [({a: int) {a:int})+] 



2 



2013/3/8 



denotes nonempty lists of even length containing record values of 
type {a:int}. The union type s\t contains all the values of s and 
of t, while the difference type s\t contains all the values of s that 
are not in t. We shall use bool as an abbreviation of the union of 
the two singleton types containing true and false: 'true| 'false, 
any and empty respectively contain all and no values. Recursive 
type definitions are also used (see Section lZ2l for formal details). 

These types can express all the types of Pig, Jaql and AQL], 
all XML types, and much more. So for instance, AQL includes 
only homogeneous lists of type t, that can be expressed by our 
types as [ t* ] . In Jaql's documentation one can find the type 
[ long(value=l) , string(value="a") , boolean* ] which 
is the type of arrays whose first element is 1, the second is the string 
"a" and all the other are booleans. This can be easily expressed in 
our types as [1 "a" bool*]. But while Jaql only allows a lim- 
ited use of regular expressions (Kleene star can only appear in tail 
position) our types do not have such restrictions. So for exam- 
ple [char* '(§' char* '.' (Cf 'r')l('i' 't'))]isthe 
type of all strings (ie, sequences of chars) that denote email ad- 
dresses ending by either . f r or . it. We use some syntactic sugar 
to make terms as the previous one more readable (eg, [ .* '®' 
. * ( ' . f r ' I ' . it ' ) ] ). Likewise, henceforth we use {a? : ty to de- 
note that the field a of type t is optional; this is just syntactic sugar 
for stating that either the field is undefined or it contains a value of 
type t (for the formal definition see Appendix |G). 

Coming back to our initial example, the filter Fil defined before 
expects as argument a collection of the following type: 

type Depts = [ ( {size?: int , ..} I Depts )* ] 

that is a, possibly empty, arbitrary nested list of records with an 
optional size field of type int: notice that it is important to specify 
the optional field and its type since a size field of a different type 
would make the expression x > 50 raise a run-time error. This 
information is deduced just from the structure of the filter (since 
Fil does not contain any type definition or annotation). 

We define a type inference system that rejects any argument of 
Fil that has not type Depts, and deduces for arguments of type 
[({size: int, addr: string}- 1 {sec: int}- I Depts) -h] 
(which is a subtype of Depts) the result type [({size: int, 
addr: string} I Depts) *] (so it does not forget the field addr 
but discards the field sec, and by replacing * for -i- recognizes that 
the test may fail). 

By encoding primitive Jaql operations into a formal core cal- 
culus we shall provide them a formal and clean semantics as 
well as precise typing. So for instance it will be clear that apply- 
ing the following dot selection [ [{a:3}] {a:5, b:true} ].a 
the result will be [ [3] 5 ] and we shall be able to deduce 
that _ . a applied to arbitrary nested lists of records with an op- 
tional integer a field (ie, of type t = {a? : int} I [ t * ] ) 
yields arbitrary nested lists of int or null values (ie, of type 
u = int I null I [ u * ] ). 

Finally we shall show that if we accept to extend the cuiTent 
syntax of Jaql (or of some other language) by some minimal filter 
syntax (eg, the pattern filter) we can obtain a huge improvement in 
the precision of type inference. 

Contributions. The main contribution of this work is the defini- 
tion of a calculus that encompasses structural operators scattered 
over NoSQL languages and that possesses some characteristics 
that make it unique in the swarm of cuixent semi-structured data 
processing languages. In particular it is parametric (though fully 
embeddable) in a host language; it uniformly handles both width 
and deep nested data recursion (while most languages offer just the 
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former and limited forms of the latter); finally, it includes first-class 
arbitrary deep composition (while most languages offer this opera- 
tor only at top level), whose power is nevertheless restrained by the 
type system. 

An important contribution of this work is that it directly com- 
pares a programming language approach with the tree transducer 
one. Our calculus implements transformations typical of top-down 
tree transducers but has several advantages over the transducer ap- 
proach: (1) the transformations are expressed in a formalism im- 
mediately intelligible to any functional programmer; (2) our cal- 
culus, in its untyped version, is Turing complete; (3) its transfor- 
mations can be statically typed (at the expenses of Turing com- 
pleteness) without any annotation yielding precise result types (4) 
even if we restrict the calculus only to well-typed terms (thus losing 
Turing completeness), it still is strictly more expressive than well- 
known and widely studied deterministic top-down tree transducer 
formalisms. 

The technical contributions are (i) the proof of Turing com- 
pleteness for our formalism, (ii) the definition of a type system 
that copes with records with computable labels (Hi) the definition 
of a static type system for filters and its correctness, (iv) the defini- 
tion of a static analysis that ensures the termination (and the proof 
thereof) of the type inference algorithm with complexity bounds 
expressed in the size of types and filters and (iv) the proof that 
the terms that pass the static analysis form a language strictly more 
expressive than top-down tree transducers. 

Outiine. In Section |2] we present the syntax of the three com- 
ponents of our system. Namely, a minimal set of expressions, the 
calculus of filters used to program user-defined operators or to en- 
code the operators of other languages, and the core types in which 
the types we just presented are to be encoded. Section |3] defines 
the operational semantics of filters and a declarative semantics for 
operators. The type system as well as the type inference algorithm 
are described in Section |4] In Section |5] we present how to han- 
dle a large subset of Jaql. Section [8] reports on some subtler de- 
sign choices of our system. We compare related works in Section[9] 
and conclude in Section[TO] In order to avoid blurring the presen- 
tation, proofs, secondary results, further encodings, and extensions 
are moved into a separate appendix. 

2. Syntax 

In this section we present the syntax of the three components of our 
system: a minimal set of expressions, the calculus of filters used to 
program user-defined operators or to encode the operators of other 
languages, and the core types in which the types presented in the 
introduction are to be encoded. 

The core of our work is the definition of filters and types. The 
key property of our development is that filters can be grafted to 
any host language that satisfies minimal requirements, by simply 
adding filter application to the expressions of the host language. 
The minimal requirements of the host language for this to be possi- 
ble are quite simple: it must have constants (typically for types int, 
char, string, and bool), variables, and either pairs or record val- 
ues (not necessarily both). On the static side the host language must 
have at least basic and products types and be able to assign a type to 
expressions in a given type environment (ie, under some typing as- 
sumptions for variables). By the addition of filter applications, the 
host language can acquire or increase the capability to define poly- 
morphic user-defined iterators, query and processing expressions, 
and be enriched with a powerful and precise type system. 

2.1 Expressions 

In this work we consider the following set of expressions 
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Definition 1 (expressions). 
Exprs e :~ c 

X 

I ie,e) 

I {e:e, . 

I e + e 

I e\£ 

I op(e, . 

I /e 



(constants) 
(variables) 
(pairs) 
(records) 
(record concatenation) 
(field deletion) 
(built-in operators) 
(filter application) 



where / ranges over filters (defined later on), c over generic con- 
stants, and I over string constants. 

Intuitively, these expressions represent the syntax supplied by 
the host language — though only the first two and one of the next 
two are really needed — that we extend with (the missing expres- 
sions and) the expression of filter application. Expressions are 
formed by constants, variables, pairs, records, and operation on 
records: record concatenation gives priority to the expression on 
the right. So if in n + r2 both records contains a field with the 
same label, it is the one in that will be taken, while field deletion 
does not require the record to contain a field with the given label 
(though this point is not important). The metavariable op ranges 
over operators as well as functions and other constructions belong- 
ing to or defined by the host language. Among expressions we sin- 
gle out a set of values, intuitively the results of computations, that 
are formally defined as follows: 

V ::= c I (v, v) \ {£:v; . . . ; £:v} 

We use "foo" for character string constants, 'c' for characters, 
1 2 3 4 5 and so on for integers, and backquoted words, such as 
'foo, for atoms (ie, user-defined constants). We use three distin- 
guished atoms 'nil, 'true, and 'false. Double quotes can be 
omitted for strings that are labels of record fields: thus we write 
{name:" John"} rather than {"name":" John"}. Sequences (aka, 
heterogeneous lists, ordered collections, arrays) are encoded a la 
LISP, as nested pairs where the atom 'nil denotes the empty list. 
Weuse[ei ... e„] as syntactic sugarfor (ei, (e„, 'nil)...). 

2.2 Types 

Expressions, in particular filter applications, are typed by the fol- 
lowing set of types (typically only basic, product, recursive and 
— some form of — record types will be provided by the host lan- 
guage): 



Definition 2 (types). 

Types t ::= b (basic types) 

1} (singleton types) 

{t,t) (products) 

I {i'-t, . . . , £:t} (closed records) 

I {i't, . . . ,£:t, ..} (open records) 

t\t (union types) 

t&it (intersection types) 

-it (negation type) 

empty (empty type) 

any (any type) 

jiT.t (recursive types) 

T (recursion variable) 

Op{t,...,t) (foreign type calls) 

where every recursion is guarded, that is, every type variable is 
separated from its binder by at least one application of a type 
constructor (ie, products, records, or Op). 

Most of these types were already explained in the introduction. 
We have basic types (int, bool, ....) ranged over by b and sin- 
gleton types 11 denoting the type that contains only the value v. 



Record types come in two flavors: closed record types whose val- 
ues are records with exactly the fields specified by the type, and 
open record types whose values are records with at least the fields 
specified by the type. Product types are standard and we have a 
complete set of type connectives, that is, finite unions, intersections 
and negations. We use empty, to denote the type that has no values 
and any for the type of all values (sometimes denoted by "_" when 
used in patterns). We added a term for recursive types, which al- 
lows us to encode both the regular expression types defined in the 
introduction and, more generally, the recursive type definitions we 
used there. Finally, we use Op (capitalized to distinguish it from 
expression operators) to denote the host language's type operators 
(if any). Thus, when filter applications return values whose type 
belongs just to the foreign language (eg, a list of functions) we sup- 
pose the typing of these functions be given by some type operators. 
For instance, if succ is a user defined successor function, we will 
suppose to be given its type in the form Arrow(int,int) and, simi- 
larly, for its application, say apply(succ,3) we will be given the type 
of this expression (presumably int). Here Arrow is a type operator 
and apply an expression operator. 

The denotational semantics of types as sets of values, that we 
informally described in the introduction, is at the basis of the defi- 
nition of the subtyping relation for these types. We say that a type 
ti is a subtype of a type t2, noted ti < t2, if and only if the set 
of values denoted by ti is contained (in the set-theoretic sense) in 
the set of values denoted by t2. For the formal definition and the 
decision procedure of this subtyping relation the reader can refer to 
the work on semantic subtyping flfll . 

2.3 Patterns 

Filters are our core untyped operators. All they can do are three 
different things: (1) they can structurally decompose and transform 
the values they are applied to, or (2) they can be sequentially 
composed, or (3) they can do pattern matching. In order to define 
filters, thus, we first need to define patterns. 

Definition 3 (patterns). 



Patterns 



P 



■- t 



X 

(p,p) 
{£:p, . 
{£:p, . 

P\p 
p&ip 



,£:p} 
,£:p,..} 



(type) 
(variable) 
(pair) 
(closed rec) 
(open rec) 
(or/union) 
(and/intersection) 



where the subpattems forming pairs, records, and intersections 
have distinct capture variables, and those forming unions have the 
same capture variables. 

Patterns are essentially types in which capture variables (ranged 
over by X, y, . . . ) may occur in every position that is not under a 
negation or a recursion. A pattern is used to match a value. The 
matching of a value v against a pattern p, noted v/p, either fails 
(noted S7) or it returns a substitution from the variables occurring 
in the pattern, into values. The substitution is then used as an 
environment in which some expression is evaluated. If the pattern is 
a type, then the matching fails if and only if the pattern is matched 
against a value that does not have that type, otherwise it returns 
the empty substitution. If it is a variable, then the matching always 
succeeds and returns the substitution that assigns the matched value 
to the variable. The pair pattern (pi,p2) succeeds if and only if it 
is matched against a pair of values and each sub-pattern succeeds 
on the corresponding projection of the value (the union of the two 
substitutions is then returned). Both record patterns are similar to 
the product pattern with the specificity that in the open record 
pattern matches all the fields that are not specified in the 
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pattern. An intersection pattern pi&p2 succeeds if and only if 
both patterns succeed (the union of the two substitutions is then 
returned). The union pattern Pi\p2 first tries to match the pattern pi 
and if it fails it tries the pattern p2 . 

For instance, the pattern {intizx,y) succeeds only if the 
matched value is a pair of values {vi,V2) in which wi is an in- 
teger — in which case it returns the substitution {x/vi,y/v2} — 
and fails otherwise. Finally notice that the notation "p as x" we 
used in the examples of the introduction, is syntactic sugar forp&a::. 

This informal semantics of matching (see llT6ll for the formal 
definition) explains the reasons of the restrictions on capture vari- 
ables in Definition[3] in intersections, pairs, and records all patterns 
must be matched and, thus, they have to assign distinct variables, 
while in union patterns just one pattern will be matched, hence the 
same set of variables must be assigned, whichever alternative is se- 
lected. 

The strength of patterns is their connections with types and the 
fact that the pattern matching operator can be typed exactly. This is 
entailed by the following theorems (both proved in lfT6ll ): 

Theorem 4 (Accepted type fT3l ). For every pattern p, the set of all 

values V such that v/p ^ Q is a type. We call this set the accepted 
type ofp and note it by 

The fact that the exact set of values for which a matching succeeds 
is a type is not obvious. It states that for every pattern p there exists 
a syntactic type produced by the grammar in Definition |2] whose 
semantics is exactly the set of all and only values that are matched 
by p. The existence of this syntactic type, which we note ^pj, is 
of utmost importance for a precise typing of pattern matching. In 
particular, given a pattern p and a type t contained in (ie, subtype of) 
Ip^, it allows us to compute the exact type of the capture variables 
of p when it is matched against a value in t: 

Theorem 5 (Type environment tT3l ). There exists an algorithm 
that for every pattern p, and t < Ip^ returns a type environment 
t/p e Vars(p) — )■ Types such that \t/p){x) = {{y/p){x) \ v -.t}. 

2.4 Filters 

Definition 6 (filters). A filter is a term generated by: 



Filters / :~ e (expression) 

I P =^ / (pattern) 

I (/,/) (product) 

I {t.f,...,t.f ,..} (record) 

I /I/ (union) 

I nX.f (recursion) 

I Xa (recursive call) 

I /;/ (composition) 

I o (declarative operators) 

Operators o ::— groupby / (filter grouping) 

I orderby / (filter ordering) 

Arguments a ::— x (variables) 

I c (constants) 

I {a,a) (pairs) 

I {£:a, £:a} (record) 



such that for every subterm of the form f;g, no recursion variable 
is free in/. 

Filters are like transducers, that when applied to a value re- 
turn another value. However, unlike transducers they possess more 
"programming-oriented" constructs, like the ability to test an in- 
put and capture subterms, recompose an intermediary result from 
captured values and a composition operator. We first describe in- 
formally the semantics of each construct. 



The expression filter e always returns the value corresponding 
to the evaluation of e (and discards its argument). The filter p =>■ / 
applies the filter / to its argument in the environment obtained by 
matching the argument against p (provided that the matching does 
not fail). This rather powerful feature allows a filter to perform two 
critical actions: (i) inspect an input with regular pattern-matching 
before exploring it and (ii) capture part of the input that can be 
reused during the evaluation of the subfilter /. If the argument ap- 
plication of fi to Vi returns v'^ then the application of the product 
filter (/i,/2) to an argument (iii, V2) returns {v'l , ^2); otherwise, if 
any application fails or if the argument is not a pair, it fails. The 
record filter is similar: it applies to each specified field the corre- 
sponding filter and, as stressed by the ". .", leaves the other fields 
unchanged; it fails if any of the applications does, or if any of the 
specified fields is absent, or if the argument is not a record. The fil- 
ter /i I /2 returns the application of /i to its argument or, if this fails, 
the application of /2. The semantics of a recursive filter is given by 
standard unfolding of its definition in recursive calls. The only real 
restriction that we introduce for filters is that recursive calls can be 
done only on arguments of a given form (ie, on arguments that have 
the form of values where variables may occur). This restriction in 
practice amounts to forbid recursive calls on the result of another 
recursively defined filter (all other cases can be easily encoded). 
The reason of this restriction is technical, since it greatly simpli- 
fies the analysis of Section 1431 (which ensures the termination of 
type inference) without hampering expressiveness: filters are Tur- 
ing complete even with this restriction (see Theorem|7). Filters can 
be composed: the filter /i;/2 applies /2 to the result of applying 
/i to the argument and fails if any of the two does. The condition 
that in every subterm of the form f;g, f does not contain free re- 
cursion variables is not strictly necessary. Indeed, we could allow 
such terms. The point is that the analysis for the termination of the 
typing would then reject all such terms (apart from trivial ones in 
which the result of the recursive call is not used in the composition). 
But since this restriction does not restrict the expressiveness of the 
calculus (Theorem |7] proves Turing completeness with this restric- 
tion), then the addition of this restriction is just a design (rather 
than a technical) choice: we prefer to forbid the programmer to 
write recursive calls on the left-hand side of a composition, than 
systematically reject all the programs that use them in a non-trivial 
way. 

Finally, we singled out some specific filters (specifically, we 
chose groupby and orderby ) whose semantics is generally 
specified in a declarative rather than operational way. These do not 
bring any expressive power to the calculus (the proof of Turing 
completeness, Theorem|7] does not use these declarative operators) 
and actually they can be encoded by the remaining filters, but it 
is interesting to single them out because they yield either simpler 
encodings or more precise typing. 

3. Semantics 

The operational semantics of our calculus is given by the reduction 
semantics for filter application and for the record operations. Since 
the former is the only novelty of our work, we save space and omit 
the latter, which are standard anyhow. 

3.1 Big step semantics 

We define a big step operational semantics for filters. The definition 
is given by the inference rules in Figure [T] for judgements of the 
form (5;7 l-„„, f{a) r and describes how the evaluation of 
the application of filter / to an argument a in an environment 7 
yields an object r where r is either a value or Q. The latter is 
a special value which represents a runtime error: it is raised by 
the rule (error) either because a filter did not match the form of 
its argument (eg, the argument of a filter product was not a pair) 
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(expr) 
(prod) 
(part) 
(comp) 



(5;7 !-„<,, e{v) r 

^<^-o! hi^i) ri 5;-/ /2(f2) ^ >'2 
<5;7 K™/ {h,f2){vi,V2) (r-i,r2) 

, v/p K-raf /(f) r 
<5;7 (p /){«^) r 

/i(f) ^ n '?;7 Kw /2(ri) ~^ r2 

<5;7 I"™/ (/i;/2)(«) ^ r-2 



r = eval(7, e) 

if ri ^ n 
and r2 ^ n 

if i)/p ^ n 
if ri 7^ n 



(unionl) 
(union!) 
(rec) 
(rec-call) 



^;7 /i(f) ri 
S;i K™, {/i|/2)W ^ri 

Km/ SI (5;7 /2(t>) ^ r2 

5;7 K™, (/i|/2)(i') -ra 

<5, (X ^ /);7 r 
d;7 K™, --r 

<5;7 fflX))(a) 



if ri 7^ n 



5;7 K™, (Xa)(v) r 



(reed) 



^.W fl{vi) ^ T-1 



(error) 

'5;7 Kv„l /n(fn) ■ 



5;7 K™, /(a) ^ f7 



if no otlier rule applies 



Figure 1. Dynamic semantics of filters 



or because some pattern matching failed (ie, the side condition 
of (patt) did not hold). Notice that the argument a of a filter is 
always a value v unless the filter is the unfolding of a recursive 
call, in which case variables may occurr in it (cf. rule rec-call). 
Environment S is used to store the body of recursive definitions. 

The semantics of filters is quite straightforward and inspired 
by the semantics of patterns. The expression filter discards its 
input and evaluates (rather, asks the host language to evaluate) the 
expression e in the current environment (expr). It can be thought 
of as the right-hand side of a branch in a match_with construct. 

The product filter expects a pair as input, applies its sub-filters 
component-wise and returns the pair of the results (prod). This 
filter is used in particular to express sequence mapping, as the first 
component /i transfoims the element of the list and /2 is applied 
to the tail. In practice it is often the case that /2 is a recursive call 
that iterates on arbitrary lists and stops when the input is 'nil. If 
the input is not a pair, then the filter fails (rule (error) applies). 

The record filter expects as input a record value with at least the 
same fields as those specified by the filter. It applies each sub-filter 
to the value in the corresponding field leaving the contents of other 
fields unchanged (reed). If the argument is not a record value or it 
does not contain all the fields specified by the record filter, or if the 
application of any subfilter fails, then the whole application of the 
record filter fails. 

The pattern filter matches its input value v against the pattern p. 
If the matching fails so the filter does, otherwise it evaluates its sub- 
filter in the environment augmented by the substitution v/p (patt). 

The alternative filter follows a standard first-match policy: If the 
filter /i succeeds, then its result is returned (union-1). If /i fails, 
then /2 is evaluated against the input value (union-2). This filter is 
particularly useful to write the alternative of two (or more) pattern 
filters, making it possible to conditionally continue a computation 
based on the shape of the input. 

The composition allows us to pass the result of /i as input to /a. 
The composition filter is of paramount importance. Indeed, without 
it, our only way to iterate (deconstruct) an input value is to use a 
product filter, which always rebuilds a pair as result. 

Finally, a recursive filter is evaluated by recording its body in 
S and evaluating it (rec), while for a recursive call we replace the 
recursion variable by its definition (rec-call). 

This concludes the presentation of the semantics of non- 
declarative filters (ie, without groupby and orderby). These form a 
Turing complete formalism (full proof in Appendix IB]: 



Theorem 7 (Turing completeness). The language formed by 
constants, variables, pairs, equality, and applications of non- 
declarative filters is Turing complete. 



Proof (sketch). We can encode untyped call-by- value A-calculus 
by first applying continuation passing style (CPS) transformations 
and encoding CPS term reduction rules and substitutions via filters. 
Thanks to CPS we eschew the restrictions on composition. □ 

3.2 Semantics of declarative filters 

To conclude the presentation of the semantics we have to define the 
semantics of groupby and orderby. We prefer to give the semantics 
in a declarative form rather than operationally in order not to tie it 
to a particular order (of keys or of the execution): 



Groupby: groupby / applied to a sequence Ivi . . . 
to a sequence [ (fci, h) ... (fc„, /„) ] such that: 

I. Vi, 1 < i < m, 3j, 1 < j < n, s.t. kj — f{vi) 

2- Vj, 1 < j < n, 3i, 1 < i < m, s.t. kj — f{vi) 

3- Vj, 1 < j < n, Ij is a sequence: [ ... v^^ ] 

4- Vj, 1 < j < n, Vfc, 1 < A; < Uj, f{vl) = kj 



^'\ reduces 



ki - 



■ J 



In is a partition of [vi ... i 
orderby / applied to . 



] 

. Vnl reduces to Lv'i . . . v'„l 



Orderby: 

such that: 

1. Lv'i . . .v'n'i is a permutation of Ivi . . . Vnl , 

2. s.t.l <i < j <n,f{v,)<f{vj) 

Since the semantics of both operators is deeply connected to a 
notion of equality and order on values of the host language, we 
give them as "built-in" operations. However we will illustrate how 
our type algebra allows us to provide very precise typing rules, 
specialized for their particular semantics. It is also possible to 
encode co-grouping (or groupby on several input sequences) with 
a combination of groupby and filters (cf. Appendix iHll. 

3.3 Syntactic sugar 

Now that we have formally defined the semantics of filters we can 
use them to introduce some syntactic sugar 

Expressions. The reader may have noticed that the productions 
for expressions (Definition [T) do not define any destructor (eg, 
projections, label selection, . . . ), just constructors. The reason is 
that destructors, as well as other common expressions, can be 
encoded by filter applications: 



e.l 

f st(e) 
snd(e) 

let p = ei in 62 



= i{i:x,..}^x)e 

= ((2;,any) =^ x)e 

= ((any,!:) =^ x)e 

= (p=^ 62)61 



if 6 then 61 else 62 = (' true ei| ' false ^ 62)6 
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mat ch e with pi =^ ei I . . . I p„ e„ 

= (pi ei| . . . |p„ ^ e„)e 

These are just a possible choice, but others are possible. For in- 
stance in Jaql dot selection is overloaded: when _ . ^ is applied to 
a record, Jaql returns the content of its i field; if the field is ab- 
sent or the argument is null, then Jaql returns null and fails if 
the argument is not a record; when applied to a list ('array' in Jaql 
terminology) it recursively applies to all the elements of the list. So 
Jaql's is precisely defined as 

/xX.({^:a;,..} =J> x I ({..}|null) ^ null I {h,t) ^ (Xh,Xt)) 

Besides the syntactic sugar above, in the next section we will use 
ti + t2 to denote the record type formed by all field types in t2 
and all the field types in fi whose label is not already present in t2. 
Similarly t\£ will denote the record types formed by all field types 
in t apart from the one labelled by I, if present. Finally, we will also 
use for expressions, types, and patterns the syntactic sugar for lists 
used in the introduction. So, for instance, [pi p2 ... Pnl is matched 
by lists of n elements provided that their i-th element matches pi. 

4. Type inference 

In this section we describe a type inference algorithm for our 
expressions. 

4.1 Typing of simple and foreign expressions 

Variables, constants, and pairs are straightforwardly typed by 

[Vars] [Constant] [Prod] 
r\- ei-.ti r h 62 : ta 

rhx:r{x) Fhc-.c rh (61,62) : (ti,f2) 

where F denotes a typing environment that is a function from ex- 
pression variables to types and c denotes both a constant and the 
singleton type containing the constant. Expressions of the host lan- 
guage are typed by the type function which given a type environ- 
ment and a foreign expression returns the type of the expression, 
and that we suppose to be given for each host language] 

[Foreign] 

r h 61 : tl ■ ■ ■ F h 6n : tn 

r I- op(6i,..., 6„) : type((r,a::i:ti,...,2;„:t„),op(2;i,...,a::„)) 

Since the various 6, can contain filter applications, thus unknown 
to the host language's type system, the rule [FOREIGN] swaps them 
with variables having the same type. 

Notice that our expressions, whereas they include filter appli- 
cations, do not include applications of expressions to expressions. 
Therefore if the host language provides function definitions, then 
the applications of the host language must be dealt as foreign ex- 
pressions, as well (cf. the expression operator apply in Section|2j2}. 

4.2 Typing of records 

The typing of records is novel and challenging because record ex- 
pressions may contain string expressions in label position, such as 
in {61:62}, while in all type systems for record we are aware of, 
labels are never computed. It is difficult to give a type to {61:62} 
since, in general, we do not statically know the value that 61 will 
return, and which is required to form a record type. All we can 
(and must) ask is that this value will be a string. To type a record 
expression {61:62}, thus, we distinguish two cases according to 



The function type must be able to handle type environments with types 
of our system. It can do it either by subsuming variable with specific types 
to the types of the host language (eg, if the host language does not support 
singleton types then the singleton type 3 will be subsumed to int) or by 
typing foreign expressions by using our types. 



whether the type ti of 61 is finite (ie, it contains only finitely many 
values, such as, say, Bool) or not. If a type is finite, (finiteness of 
regul ar ty pes seen as tree automata can be decided in polynomial 
time lldl ). then it is possible to write it as a finite union of values 
(actually, of singleton types). So consider again {61:62} and let ti 
be the type of 61 and t2 the type of 62. First, ti must be a sub- 
type of string (since record labels are strings). So if ti is finite 
it can be expressed as ^i| ■ ■ ■ which means that ei will return 
the string li for some i G [I..71]. Therefore {61:62} will have type 
{li : 12} for some i G [l..n] and, thus, the union of all these types, 
as expressed by the rule [RCD-FiN] below. If ti is infinite instead, 
then all we can say is that it will be a record with some (unknown) 
labels, as expressed by rule [RCD-iNF]. 

[Rcd-Fin] 

rh6:^i|---|C The'-.t 
{e:e'}:{£i:t}\---\{£„:t} 

[RCD-iNF] 

r h 6 : t r h 6' : t' t < string 

Fh (6:6'} : {..} t is infinite 

[RCD-MUL] 

F h {61:6;} : fi ■■■ rh{6„:6;}:t„ 
F h {6i:e'i, . . . , 6„:e^ : ti + ■ ■ ■ + t„ 

[RCD-CONC] [RCD-DEL] 

F h 61 : ti F h 62 : ti < {..} F h e : t t<i \ 

Fh6i+62 :ti+t2 *2<{"} rhe\l:t\l - ^"^ 

Records with multiple fields are handled by the rule [RCD-IVIUL] 
which "merges" the result of typing single fields by using the type 
operator + as defined in CDuce L4, JL5], which is a right-priority 
record concatenation defined to take into account undefined and 
unknown fields: for instance, {a:int, b:int} + {a?:bool} = 
{a:int|bool, fe:int}; unknown fields in the right-hand side may 
override known fields of the left-hand side, which is why, for in- 
stance, we have {a:int, 6:bool} + {6:int,..} — {6:int,..}; 
likewise, for every record type t (ie, for every t subtype of {..}) 
we have t + {..} = {..}. Finally, [RCD-CONC] and [Rcd-Del] 
deal with record concatenation and field deletion, respectively, in 
a straightforward way: the only constraint is that all expressions 
must have a record type (ie, the constraints of the form ... < {..}). 
See Appendix |G]for formal definitions of all these type operators. 

Notice that these rules do not ensure that a record will not have 
two fields with the same label, which is a run-time error. Detect- 
ing such an eiTor needs sophisticated type systems (eg, dependent 
types) beyond the scope of this work. This is why in the rule [RCD- 
IVIUL] we used type operator "+" which, in case of multiple occur- 
ring labels, since records are unordered, corresponds to randomly 
choosing one of the types bound to these labels: if such a field is 
selected, it would yield a run-time error, so its typing can be am- 
biguous. We can fine tune the rule [RCD-MUL] so that when all the 
ti are finite unions of record types, then we require to have pairwise 
disjoint sets of labels; but since the problem would still persist for 
infinite types we prefer to retain the current, simpler formulation. 

4.3 Typing of filter application 

Filters are not first-class: they can be applied but not passed around 
or computed. Therefore we do not assign types to filters but, as for 
any other expression, we assign types to filter applications. The 
typing rule for filter application 

[Filter-App] 

F h 6 : t F ;0 ;0 \-„i f{t) : s 
F h /6 : s 



7 



2013/3/8 



relies on an auxiliary deduction system for judgments of the form 
r ;A ;M h/,7 fit) : s that states that if in the environments 
r, A, Af (explained later on) we apply the filter / to a value of 
type t, then it will return a result of type s. 

To define this auxiliary deduction system, which is the core of 
our type analysis, we first need to define \ f], the type accepted by 
a filter /. Intuitively, this type gives a necessary condition on the 
input for the filter not to fail: 

Definition 8 (Accepted type). Given a filter /, the accepted type 
of /, written \ f] is the set of values defined by: 



\Xa'\ — any 

^groupby f\ = [any*] 
^orderby — [any*] 



le'] = any 

It is easy to show that an argument included in the accepted type is 
a necessary (but not sufficient, because of the cases for composition 
and recursion) condition for the evaluation of a filter not to fail: 

Lemma 9. Let f be a filter and v be a value such that v ^ \f^. 
For every 7, S, if S;^ h„.„/ f{v) ^ r, then r = 0.. 

The proof is a straightforward induction on the structure of the 
derivation, and is detailed in Appendix |C] The last two auxiliary 
definitions we need are related to product and record types. In the 
presence of unions, the most general form for a product type is a 
finite union of products (since intersections distribute on products). 
For instance consider the type 

(int,int) I (string, string) 
This type denotes the set of pairs for which either both projections 
are int or both projections are string. A type such as 

(int I string, int I string) 
is less precise, since it also allows pairs whose first projection is an 
int and second projection is a string and vice versa. We see that 
it is necessary to manipulate finite unions of products (and similarly 
for records), and therefore, we introduce the following notations: 

Lemma 10 (Product decomposition). Let t G Types such that 

t < (any,any). A product decomposition oft, denoted by ir{t) is a 

set of types: , , 

n{t) = {{t\,t\),...,(f^,,t^)} 

such that t = Vt Gir{t) For a given product decomposition, we 
say that n is the rank of t, noted rank(t), and use the notation 
Trf (t) for the type tj . 

There exist several suitable decompositions whose details are 
out of the scope of this paper. We refer the interested reader to fT^ 
and ll23ll for practical algorithms that compute such decompositions 
for any subtype of (any, any) or of {..}. These notions of decom- 
position, rank and projection can be generalized to records: 

Lemma 11 (Record decomposition). Let t e Types such that 
t < {..}. A record decomposition oft, denoted by pit) is a finite 
set of types p(t)={ri, . . . , r-,i} where each is either of the form 
{€[ :t\, . . . ,£„. :t'„ . } or of the form {Pi •.t\, . . . , itjj . , ..} and 
such that t = Vr Gp(t) ^""^ given record decomposition, we 
say that n is the rank oft, noted rcmk(t), and use the notation (t) 
for the type of label £ in the j" component of pit). 

In our calculus we have three different sets of variables. The 
set Vars of term variables, ranged over by x,y,..., introduced 
in patterns and used in expressions and in arguments of calls of 
recursive filters. The set RVars of term recursion variables, ranged 
over by X,Y, ... and that are used to define recursive filters. The 
set TVars of type recursion variables, ranged over by T, ?7, ... used 



to define recursive types. In order to use them we need to define 
three different environments: F : Vars — >■ Types denoting type 
environments that associate term variables with their types; A : 
RVars — >■ Filters denoting definition environments that associate 
each filter recursion variable with the body of its definition; M : 
RVars x Types — >■ TVars denoting memoization environments 
which record that the call of a given recursive filter on a given 
type yielded the introduction of a fresh recursion type variable. 
Our typing rules, thus work on judgments of the form F ; A ;Af h 
fit) : t' stating that applying / to an expression of type t in the 
environments F, A, M yields a result of type t' . This judgment 
can be derived with the set of rules given in Figure |2] 

These rules are straightforward, when put side by side with the 
dynamic semantics of filters, given in Section[3] It is clear that this 
type system simulates at the level of types the computations that are 
carried out by filters on values at runtime. For instance, rule [Fil- 
EXPR] calls the typing function of the host language to determine 
the type of an expression e. Rule [Fil-Prod] applies a product filter 
recursively on the first and second projection for each member of 
the product decomposition of the input type and returns the union 
of all result types. Rule [Fil-Rec] for records is similar, recursively 
applying sub-filters label-wise for each member of the record de- 
composition and returning the union of the resulting record types. 
As for the pattern filter (rule [Fil-Pat]), its subfilter / is typed in 
the environment augmented by the mapping t/p of the input type 
against the pattern (cf. Theorem |5}. The typing rule for the union 
filter, [Fil-Union] reflects the first match policy: when typing the 
second branch, we know that the first was not taken, hence that at 
runtime the filtered value will have a type that is in t but not in \ fi]. 
Notice that this is not ensured by the definition of accepted type — 
which is a rough approximation that discards grosser errors but, 
as we stressed right after its definition, is not sufficient to ensure 
that evaluation of /i will not fail — but by the type system itself: 
the premises check that is well-typed which, by induction, 

implies that /i will never fail on values of type t\ and, ergo, that 
these values will never reach /2 . Also, we discard from the output 
type the contribution of the branches that cannot be taken, that is, 
branches whose accepted type have an empty intersection with the 
input type t. Composition (rule [Fil-Comp]) is straightforward. In 
this rule, the restriction that /i is a filter with no open recursion 
variable ensures that its output type s is also a type without free 
recursion variables and, therefore, that we can use it as input type 
for /2. The next three rules work together. The first, [Fil-Fix] intro- 
duces for a recursive filter a fresh recursion variable for its output 
type. It also memoize in A that the recursive filter X is associated 
with a body / and in M that for an input filter X and an input type 
t, the output type is the newly introduced recursive type variable. 
When dealing with a recursive call X two situations may arise. 
One possibility is that it is the first time the filter X is applied to 
the input type t. We therefore introduce a fresh type variable T 
and recurse, replacing X by its definition /. Otherwise, if the input 
type has already been encountered while typing the filter variable 
X, we can return its memoized type, a type variable T. Finally, 
Rule [Fil-OrdBy] and Rule [Fil-GrpBy] handle the special cases 
of groupby and orderby filters. Their typing is explained in the 
following section. 



4.4 Typing of orderby and groupby 

While the "structural" filters enjoy simple, compositional typing 
rules, the ad-hoc operations orderby and groupby need specially 
crafted rules. Indeed it is well known that when transformation 
languages have the ability to compare data values type-checking 
(and also type inference) becomes undecidable (eg, see ^ 0]). 
We therefore provide two typing approximations that yield a good 
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[Fil-Prod] 

[FIL-EXPR] [FIL-PAT] i=l..rank(i), j=l,2 T ;A jM K, /,(<{t)) : s!- 



r;A;Mh„eW:type(r,e,) T ;A ;M h„ p ^ /(t) : . * ^ ^ ^ ^ T ;A ;A/ h,, : \/ (sl,4) 

1 — 1. .raiilc(t) 

[Fil-Rec] [Fil-Union] 

i=l..raiik(t), j=l..m F ;A ;M hf, fj{pl{t)) : s'. i = 1,2 F -.A -.M hj,, fiiU) : Si t<I/iIII/2l 

- ti = I /il 

r;A;Mh^,{£i:/i,...,f„:/„, \/ {fi^sl, . . . , ,..} T ;A ;M h„ /i|/2(i) : V t2 = t&-l/ii 

i = l..rank{i) {»l « i T^empty } 

[FIL-COMP] [FIL-FIX] 
r ; A ;M hj,, /i (f ) : s F ; A ;M hj,, fjjs) : s' F ;A, (X ^ /) ;M, (jX, t) k» T) h^, fjt) 

F ;A ;M hf, h;f2{t) : s' F ;A ;M hj,, {liX.f ){t) : fiT.s 

[Fil-Call-New] [Fil-Call-Mem] 



■ T fresh 



r ;A ;M, {{X, t) ^ T) h,-, A{X){t) : t' * = .^yP^jr-") 
(A, t) ^ dom{M) 



t — type(r, a) 



r ;A ;A/ ^f, {Xa){s) : liT.t' t fresh T ;A ;M hf, {Xa){s) : M{X, t) *) ^ dom{M) 

[Fil-OrdBy] [Fil-GrpBy] 

Vi, e item(t) r ;A ;A-/ ^f, f{U) : Si t< [any*] Vt; G item(t) F ;A ;M h/,, f{U) : Si ^ , , 

'■ \ i~ ■ J J * < [any*] 

r ;A ;M (orderby f){t) : OrderBy(t) V,s, is ordered r (groupby f){t) : [((V, S; ),OrderBy(t))*] 

Figure 2. Type inference algorithm for filter application 



compromise between precision and decidability. First we define an 
auxiliary function over sequence types: 

Definition 12 (Item set). Let t £ Types such that t < [any*] . 
The item set of t denoted by item(t) is defined by: 

item(empty) = 

item(f) = item(f&;(any,any)) if i ^ (any,any) 

item(\/ = [j{{t]} U item(f?)) 

l<i<ranl[(t) l<i<rsmk(t) 

The first and second line in the definition ensure that item() returns 
the empty set for sequence types that are not products, namely for 
the empty sequence. The third line handles the case of non-empty 
sequence type. In this case t is a finite union of products, whose 
first components are the types of the "head" of the sequence and 
second components are recursively the types of the tails. Note also 
that this definition is well-founded. Since types are regular trees the 
number of distinct types accumulated by item() is finite. We can 
now defined typing rules for the orderby and groupby operators. 

orderby /; The orderby filter uses its argument filter / to 
compute a key from each element of the input sequence and then 
returns the same sequence of elements, sorted with respect to their 
key. Therefore, while the types of the elements in the result are still 
known, their order is lost. We use item() to compute the output 
type of an orderby application: 

OrderBy(t) = l{\/ti)*l 

tiGitem(£) 

groupby /.• The typing of orderby can be used to give a rough 
approximation of the typing of groupby as stated by rule [FlL- 
GrpBy]. In words, we obtain a list of pairs where the key com- 
ponent is the result type of / applied to the items of the sequence, 
and use OrderBy to shuffle the order of the list. A far more pre- 
cise typing of groupby that keeps track of the relation between list 
elements and their images via / is given in Appendix IeI 

4.5 Soundness, termination, and complexity 

The soundness of the type inference system is given by the property 
of subject reduction for filter application 

Theorem 13 (subject reduction). If ;0 ;0 h^,, f{t) : s, then 
for all V : t, 0;0 \-„ai f{v) ^ r implies r : s. 



whose full proof is given in Appendix |C] It is easy to write a fil- 
ter for which the type inference algorithm, that is the deduction 
of hni, does not terminate: fiX.x X{x,x). The deduction of 
r ;A ;AI fit) : s simulates an (abstract) execution of the 
filter / on the type t. Since filters are Turing complete, then in 
general it is not possible to decide whether the deduction of l-r,i 
for a given filter / will terminate for every input type t. For this 
reason we define a static analysis Check(f) for filters that ensures 
that if / passes the analysis, then for every input type t the deduc- 
tion of r ;A ;M \-fi, f{i) : s terminates. For space reasons the 
formal definition of Check{f) is relegated to Appendix IaI but its 
behavior can be easily explained. Imagine that a recursive filter / 
is applied to some input type t. The algorithm tracks all the recur- 
sive calls occurring in /; next it performs one step of reduction of 
each recursive call by unfolding the body; finally it checks in this 
unfolding that if a variable occurs in the argument of a recursive 
call, then it is bound to a type that is a subtree of the original type 
t. In other words, the analysis verifies that in the execution of the 
derivation for f{t) every call to s/p for some type s and pattern 
p always yields a type environment where variables used in re- 
cursive calls are bound to subtrees of t. This implies that the rule 
[Fil-Call-new] will always memoize for a given X, types that are 
obtained from the arguments of the recursive calls of X by replac- 
ing their variables with a subtree of the original type t memoized 
by the rule [Fil-Fix]. Since t is regular, then it has finitely many 
distinct subtrees, thus [Fil-Call-New] can memoize only finitely 
many distinct types, and therefore the algorithm terminates. 

More precisely, the analysis proceeds in two passes. In the first 
pass the algorithm tracks all recursive filters and for each of them it 
(i) marks the variables that occur in the arguments of its recursive 
calls, (ii) assigns to each variable an abstract identifier represent- 
ing the subtree of the input type to which the variable will be bound 
at the initial call of the filter, and (Hi) it returns the set of all types 
obtained by replacing variables by the associated abstract identifier 
in each argument of a recursive call. The last set intuitively repre- 
sents all the possible ways in which recursive calls can shuffle and 
recompose the subtrees forming the initial input type. The second 
phase of the analysis first abstractly reduces by one step each re- 
cursive filter by applying it on the set of types collected in the first 
phase of the analysis and then checks whether, after this reduction, 
all the variables marked in the first phase (ie, those that occur in ar- 
guments of recursive calls) are still bound to subtrees of the initial 
input type: if this checks fails, then the filter is rejected. 
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It is not difficult to see that tlie type inference algorithm con- 
verges if and only if for every input type there exists a integer n 
such that after n recursive calls the marked variables are bound 
only to subtrees of the initial input type (or to something that does 
not depend on it, of course). Since deciding whether such an n 
exists is not possible, our analysis checks whether for all possible 
input types a filter satisfies it for n=l, that is to say, that at every 
recursive call its marked variables satisfy the property; otherwise it 
rejects the filter. 

Theorem 14 (Termination). IfCheck{f), then for every type t the 
deduction ofT ;0 ;0 \-fi f{t) : s is in 2-EXPTIME. Furthermore, 
if t is given as a non-deterministic tree automaton (NTA) then 
r ;0 ;0 \-fii fit) : s is in EXPTIME, where the size of the problem 
is I/I X \t\. 

(for proofs see Appendix |A] for termination and Appendix |D] for 
complexity). This complexity result is in line with those of similar 
formalisms. For instance in ||20ll . it is shown that type-checking non 
deterministic top-down tree transducers is in EXPTIME when the 
input and output types are given by a NTA. 

All filters defined in this paper (excepted those in Appendix |B]l 
pass the analysis. As an example consider the filter rotate that ap- 
plied to a list returns the same list with the first element moved to 
the last position (and the empty list if applied to the empty list): 

HX. ( {x,{y,z)) ^ {y,X{x,z)) \ w =^ w ) 
The analysis succeeds on this filter. If we denote by Lx the abstract 
subtree bound to the variable x, then the recursive call will be ex- 
ecuted on the abstract argument {ix,i.z)- So in the unfolding of the 
recursive call x is bound to tx, whereas y and z are bound to two 
distinct subtrees of t^. The variables in the recursive call, x and z, 
are thus bound to subtrees of the original tree (even though the ar- 
gument of the recursive call is not a subtree of the original tree), 
therefore the filter is accepted . In order to appreciate the precision 
of the inference algorithm consider the type [int+ bool+] , that is, 
the type of lists formed by some integers (at least one) followed by 
some booleans (at least one). For the application of rotate to an 
argument of this type our algorithm statically infers the most pre- 
cise type, that is, [int* bool+ int] . If we apply it once more the 
inferred type is [int* bool+ int int] I [bool* int bool]. 

Generic filters are Turing complete. However, requiring that 
Checki) holds — meaning that the filter is typeable by our system — 
restricts the expressive power of our filters by preventing them 
from recomposing a new value before doing a recursive call. For 
instance, it is not possible to typecheck a filter which reverses the 
elements of a sequence. Determining the exact class of transforma- 
tions that typeable filters can express is challenging. However it is 
possible to show {cf. Appendix that typeable filters are strictly 
more expressive than top-down tree transducers with regular look- 
ahead, a formalism for tree transformations introduced in |[T4ll . The 
intuition about this result can be conveyed by and example. Con- 
sider the tree: 

a(tti(...(tt„()))ui(...(u,„()))) 
that is, a tree whose root is labeled a with two children, each being 
a monadic tree of height n and m, respectively. Then it is not pos- 
sible to write a top-down tree transducer with regular look- ahead 
that creates the tree 

a{u-i{. . . . . Um()))))) 

which is just the concatenation of the two children of the root, seen 
as sequences, a transformation that can be easily programmed by 
typeable filters. The key difference in expressive power comes from 
the fact that filters are evaluated with an environment that binds 
capture variables to sub-trees of the input. This feature is essential 
to encode sequence concatenation and sequence flattening — two 
pervasive operations when dealing with sequences — that cannot be 
expressed by top-down tree transducers with regular look-ahead. 



5. Jaql 

In this Section, we show how filters can be used to capture some 
popular languages for processing data on the Cloud. We consider 
Jaql lITstl , a query language for JSON developed by IBM. We give 
translation rules from a subset of Jaql into filters. 

Definition 15 (Jaql expressions). We use the following simplified 
grammar for Jaql (where we distinguish simple expressions, ranged 
over by e, from "core expressions" ranged over by k). 



e] 



e:e } 



,e) 



I [e 

I { e:e, . 

I e.l 

I op(e, . , 

I e -> k 

k ::= filter (each x )? e 

I transform (each x)l e 

I expand ((each x)? e)? 



(constants ) 
(variables) 
(current value) 
(arrays ) 
(records) 
(field access) 
(function call) 
(pipe) 
(filter) 
(transform) 
(expand) 



I group ((each x)? by x = e (as x)?)? into e (grouping) 



5.1 Built-in filters 

In order to ease the presentation we extend our syntax by adding 
"filter definitions" (already informally used in the introduction) to 
filters and "filter calls" to expressions: 

e ::= let filter F [Fi, F„] =/ in e (filter defn.) 
/ F[f,...,fl (call) 

where F ranges over filter names. The mapping for most of the 
language we consider rely on the following built-in filters. 

let filter Filter [F] =pX. 
'nil 'nil 
I ((x, xs),tl) (X{x, xs),X{tl)) 
I lx,tl) ^ Fx ;('true ^ {x, X{tl))y false =J> X{tl)) 

let filter Transform [F] =nX. 
'nil =J> 'nil 
I ((x, xs),tl) =>■ {X{x, xs),X{tl)) 
I lx,tl)^{Fx,X{tl)) 

let filter Expand =fiX. 
'nil ^ 'nil 
I {'n±l,tl)=i>- X{tl) 
I ((a;, xs),tl) =>■ {x,X{xs,tl)) 



5.2 Mapping 

Jaql expressions are mapped to our expressions as follows (where 
^ is a distinguished expression variable interpreting Jaql's $): 

14 = c 

M = ^ 

1^1 = $ 

I{ei:e'i,...,e„:e;}] = {|ei] :|e;] , :|e'„] } 

le.l] = M.l 

|op(ei,...,e„)l = opdeil , ...,|e„|) 

[[ei,...,e„]l = (Ieil,...([eJ,'nil)...) 

Ie->fc] = H;Wf 
Jaql core expressions are mapped to filters as follows: 
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|filtere|p = [filter each $ ej^, 

[filter each x ej^ = Filter [a; [e] ] 
|transf orm e]p =|transform each ^e]p 

|transform eachxejp = Transform =^ |e] ] 
[expand each x e]p 
[expand] p = Expand 

[group into e]p = [group by y=true into e]p 

[group by i/=ei into e2|p= [group each $ byj/=ei into e2|p 
[group each x by j/=ei into 62] p = 

[group each x by y = ei as $ into 62] p 
[group each x by y = ei as 3 into 62] p = 

groupby x [ei| ; Transform L{y,g) [62] ] 
This translation defines the (first, in our knowledge) formal seman- 
tics of Jaql. Such a translation is all that is needed to define the 
semantics of a NoSQL language and, as a bonus, endow it with the 
type inference system we described without requiring any modi- 
fication of the original language. No further action is demanded 
since the machinery to exploit it is all developed in this work. 

As for typing, every Jaql expression is encoded into a filter for 
which type-checking is ensured to terminate: CheckQ holds for 
Filter [] , Transform [] , and Expand (provided it holds also for 
their arguments) since they only perform recursive calls on recom- 
binations of subtrees of their input; by its definition, the encoding 
does not introduce any new recursion and, hence, it always yields a 
composition and application of filters for which CheckQ holds. 

5.3 Examples 

To show how we use the encoding, let us encode the example of 
the introduction. For the sake of the concision we will use filter 
definitions (rather than expanding them in details). We use Fil 
and Sel defined in the introduction. Expand and Transform [] 
defined at the beginning of the section, the encoding of Jaql's field 
selection as defined in Section [331 and finally Head that returns the 
first element of a sequence and a family of recursive filters Rgrpi 
with i G N"^ both defined below: 

let filter Head = 'nil => null I (x,xs) => x 

let filter Rgrpi = 'nil => 'nil 

I ((i,x),tail) => (x , Rgrpi tail) 
I _ => Rgrpi tail 

Then, the query in the introduction is encoded as follows 

[employees depts] ; 
[Sel Fil] ; 

[Transform[x =>(l,x)] Transform[x =>(2,x)]]; 
Expand; 

groupby ( (1, $)=>$. dept I (2,$)=>$.depid ); 
Transform[(g,l)=>( 

[(1; Rgrpl) (1; Rgrp2)] ; 
[es ds] => 

{ dept: g, 

deptName : (ds ; Head) .name) , 
numEmps : count (es) } )] 

In words, we perform the selection on employees and filter the 
departments (lines 1-2); we tag each element by 1 if it comes from 
employees, and by 2 if it comes from departments (line 3); we 
merge the two collections (line 4); we group the heterogeneous list 
according to the coiresponding key (line 5); then for each element 
of the result of grouping we capture in g the key (line 6), split the 
group into employees and depts (line 7), capture each subgroup into 
the corresponding variable (ie, es and ds) (line 8) and return the 
expression specified in the query after the "into" (lines 8-10). The 
general definition of the encoding for the co-grouping is given in 
Appendix IhI 



Let us now illustrate how the above composition of filters is 
typed. Consider an instance where: 

• employees has type [ Remp* ] , where 
Remp = { dept : int , income : int , . . } 

• depts has type [ {Rdep I Rbrancli)* ], where 

= {depid: int , name: string, size: int} 
Rbranch = {brid : int , name : string} 
(this type is a subtype of Dept as defined in the introduction) 

The global input type is therefore (line 1) 

[ [ Remp* ] [ CRdep I Rbranch)* ] ] 
which becomes, after selection and filtering (line 2) 

[ [ Remp* ] [ Rdep* ] ] 
(note how all occurrences of Rbranch are ignored by Fil). Tagging 
with an integer (line 3) and flattening (line 4) yields 

[ {l,Remp)* {2, Rdep)* ] 
which illustrates the precise typing of products coupled with sin- 
gleton types (ie, 1 instead of int). While the groupby (line 5) in- 
troduces an approximation the dependency between the tag and the 
corresponding type is kept 

[ (int, [ {{l,Remp) I {2, Rdep) )+ ]) * ] 
Lastly the transform is typed exactly, yielding the final type 
[ {dept: int, deptName : string I null , numEmps : int }* ] 
Note how null is retained in the output type (since there may be 
employees without a department, then Head may be applied to an 
empty list returning null, and the selection of name of null re- 
turns null). For instance suppose to pipe the Jaql grouping defined 
in the introduction into the following Jaql expression, in order to 
produce a printable representation of the records of the result 

transform each x ( 
(x .deptName)®" : "@(to_string x. dep)@" : "@(x. numEmps) ) 

where @ denotes string concatenation and to_string is a conver- 
sion operator (from any type to string). The composition is ill-typed 
for three reasons: the field dept is misspelled as dep, x . numEmps 
is of type int (so it must be applied to to_string before con- 
catenation), and the programmer did not account for the fact that 
the value stored in the field deptName may be null. The encoding 
produces the following lines to be appended to the previous code: 

I Transform [ x => 

1 (x. deptName)®" : "(§(to_string x. dep)®" : "@(x. numEmps)] 

in which all the three errors are detected by our type system. A 
subtler example of error is given by the following alternative code 

I Transf orm[ 

1 { dept : d, deptName: n&String, numEmps: e } => 
1 n @ " : " (§ (to_string d) (§ " : " @ (to_string e) 

i I { deptName : null , ..}=>"" 
i I _ => "Invalid department" ] 

which corrects all the previous errors but adds a new one since, as 
detected by our type system, the last branch can be never selected. 
As we can see, our type-system ensures soundness, forcing the pro- 
grammer to handle exceptional situations (as in the null example 
above) but is also precise enough to detect that some code paths 
can never be reached. 

In order to focus on our contributions we kept the language of 
types and filters simple. However there already exists several con- 
tributions on the types and expressions used here. Two in particular 
are worth mentioning in this context: recursive patterns and XML. 

Definition [3] defines patterns inductively but, alternatively, we 
can consider the (possibly infinite) regular trees coinductively gen- 
erated by these productions and, on the lines of what is done in 
CDuce, use the recursive patterns so obtained to encode regular 
expressions patterns (see |0I). Although this does not enhance ex- 
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pressiveness, it greatly improves the writing of programs since it 
makes it possible to capture distinct subsequences of a sequence by 
a single match. For instance, when a sequence is matched against 
a pattern such as [ (int as x I bool as y I _)* ], then x 
captures (the list of) all integer elements (capture variables in reg- 
ular expression patterns are bound to lists), y captures all Boolean 
elements, while the remaining elements are ignored. By such pat- 
terns, co-grouping can be encoded without the Rgrp. For instance, 
the transform in lines 6-11 can be more compactly rendered as: 

Transform[(g, [ ( (1 , es) I (2,ds) ) * ]) => 
{ dept: g, 

deptName: (ds; Head) .name, 
numEmps : count (es) }] 

For what concerns XML, the types used here were originally de- 
fined for XML, so it comes as a no surprise that they can seamlessly 
express XML types and values. For example CDuce uses the very 
same types used here to encode both XML types and elements as 
triples, the first element being the tag, the second a record repre- 
senting attributes, and the third a heterogeneous sequence for the 
content of the element. Furthermore, we can adapt the results of ^ 
to encode forward XPath queries in filters. Therefore, it requires 
little effort to use the filters presented here to encode languages 
such as JSONiq (U designed to integrate JSON and XML, or to 
precisely type regular expressions, the import/export of XML data, 
or XPath queries embedded in Jaql programs. This is shown in the 
section that follows. 



6. JSON, XML, Regex 

There exist various attempts to integrate JSON and XML. For 
instance JSONiq liT2[l is a query language designed to allow XML 
and JSON to be used in the same query. The motivation is that 
JSON and XML are both widely used for data interchange on the 
Internet. In many applications, JSON is replacing XML in Web 
Service APIs and data feeds, while more and more applications 
support both formats. More precisely, JSONiq embeds JSON into 
an XML query language (XQuery), but it does it in a stratified way: 
JSONiq does not allow XML nodes to contain JSON objects and 
arrays. The result is thus similar to OCamlDuce, the embedding of 
CDuce's XML types and expressions into OCaml, with the same 
drawbacks. 

Our type system is derived from the type system of CDuce, 
whereas the theory of filters was originally designed to use CDuce 
as an host language. As a consequence XML types and expressions 
can be seamlessly integrated in the work presented here, without 
any particular restriction. To that end it suffices to use for XML 
elements and types the same encoding used in the implementation 
of CDuce, where an XML element is just a triple formed by a tag 
(here, an expression), a record (whose labels are the attributes of 
the element), and a sequence (of characters and or other XML ele- 
ments) denoting its content. So for instance the following element 

<product system="US-size"> 

<number>557</number> 

<naine>Blouse</naine> 
< /product > 

is encoded by the following triple: 

("product" , { system : "US-size" } , 
[ 

("number" , {> , [ 557 ] ) 
("name", {}, "Blouse") 

] 

) 



and this latter, with the syntactic sugar defined for CDuce, can be 
written as: 

<product system="US-size"> [ 
<number> [ 557 ] 
<name> [ Blouse ] 

] 

Clearly in our system there are no restrictions in merging and 
nesting JSON and XML and no further extension is required to our 
system to define XML query and processing expressions. Just, the 
introduction of syntactic sugar to make the expressions readable, 
seems helpful: 

e ::— <e e>e 

f <ff>f 
The system we introduced here is already able to reproduce (and 
type) the same transformations as in JSONiq, but without the re- 
strictions and drawbacks of the latter (this is why we argue that it 
is better to extend NoSQL languages with XML primitives directly 
derived from our system rather than to use our system to encode 
JSONiq). For instance, the example given in the JSONiq draft to 
show how to render in Xhtml the following JSON data: 

{ 

"col labels" : ["singular", "plural"], 
"row labels" : ["Ip", "2p", "3p"] , 
"data" : 
[ 

["spinne", "spinnen"] , 
["spinnst", "spinnt"] , 
["spinnt", "spinnen"] 

] 

} 

can be encoded in the filters presented in this work (with the new 
syntactic sugar) as: 

{ "col labels" : cl , 

"row labels" : rl , 

"data" : dl 
} => 

<table border="l" cellpadding="l" cellspacing="2"> [ 
<tr>[ <th>[ ] !(cl; Transform[ x -> <th>x ]) ] 
!(rl; TransformC h -> 

<tr>[ <th>h !(dl; Transform [ x -> <td>x ]) 

] 

) 

] 

(where ! expands a subsequence in the containing sequence). The 
resulting Xhtml document is rendered in a web browser as: 











spinne 


spinnen 




spinnst 


spinnt 




spinnt 


spinnen 



Similarly, Jaql built-in libraries include functions to convert and 
manipulate XML data. So for example as it is possible in Jaql to 
embed SQL queries, so it is possible to evaluate XPath expres- 
sions, by the function xpathO which takes two arguments, an 
XML document and a string containing an xpath expression — eg, 
xpath(read(seq( ("conf /addrs .xml") ) , "content/city") 
— . Filters can encode forward XPath expressions (see ll23ll ) and 
precisely type them. So while in the current implementation there 
is no check of the type of the result of an external query (nor for 
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XPath or for SQL) and the XML document is produced indepen- 
dently from Jaql, by encoding (forward) XPath into filters we can 
not only precisely type calls to Jaql's xpathO function but also 
feed them with documents produced by Jaql expressions. 

Finally, the very same regular expressions types that are used 
to describe heterogeneous sequences and, in particular, the con- 
tent of XML elements, can be used to type regular expressions. 
Functions working on regular expressions (regex for short) form, 
in practice, yet another domain specific language that is embedded 
in general purpose languages in an untyped or weakly typed (typ- 
ically, every result is of type string) way. Recursive patterns can 
straightforwardly encode regexp matching. Therefore, by combin- 
ing the pattern filter with other filters it is possible to encode any 
regexp library, with the important advantage that, as stated by The- 
orem|4] the set of values (respectively, strings) accepted by a pattern 
(respectively, by a regular expression), can be precisely computed 
and can be expressed by a type. So a function such as the built-in 
Jaql's function regex_extract () , which extracts the substrings 
that match a given regex, can be easily implemented by filters and 
precisely typed by our typing rules. Typing will then amount to 
intersect the type of the string (which can be more precise than 
just string) with the type accepted by the pattern that encodes the 
regex at issue. 

7. Programming with filters 

Up to now we have used the filters to encode operators hard- 
coded in some languages in order to type them. Of course, it is 
possible to embed the typing technology we introduced directly 
into the compilers of these language so as to obtain the flexible 
typing that characterizes our system. However, an important aspect 
of filters we have ignored so far is that they can be used directly by 
the programmer to define user-defined operators that are typed as 
precisely as the hard-coded ones. Therefore a possibility is extend 
existing NoSQL languages by adding to their expressions the filter 
application fe expression. 

The next problem is to decide how far to go in the definition 
of filters. A complete integration, that is taking for / all the defi- 
nitions given so far, is conceivable but might disrupt the execution 
model of the host language, since the user could then define com- 
plex iterators that do not fit map-reduce or the chosen distributed 
compilation policy. A good compromise could be to add to the host 
language only filters which have "local" effects, thus avoiding to 
affect the map-reduce or distributed compilation execution model. 
The minimal solution consists in choosing just the filters for pat- 
terns, unions, and expressions: 

/ e I p => / I f\f 

Adding such filters to Jaql (we use the "=>" arrow for patterns in 
order to avoid confusion with Jaql's "->" pipe operator) would not 
allow the user to define powerful operators, but their use would al- 
ready dramatically improve type precision. For instance we could 
define the following Jaql expression 

transform ( {a: x, . .)■ as y => {y . * , sum: x+x} I y => y ) 

(with the convention that a filter occurring as an expression de- 
notes its application to the current argument $ ). With this syntax, 
our inference system is able to deduce that feeding this expression 
with an argument of type [{a?:int, c:bool)-*] returns a result 
of type [({a:int, c:bool, sum:int]- I {c :bool]-) *] . This 
precision comes from the capacity of our inference system to dis- 
criminate between the two branches of the filter and deduce that 
a sum field will be added only if the a field is present. Similarly 
by using pattern matching in a Jaql "filter" expression, we can 
deduce that filter ( int => true I _ => false ) fed with any 
sequence of elements always returns a (possibly empty) list of in- 



tegers. An even greater precision can be obtained for grouping ex- 
pressions when the generation of the key is performed by a filter 
that discriminates on types: the result type can keep a precise cor- 
respondence between keys and the corresponding groups. As an 
example consider the following (extended) Jaql grouping expres- 
sion: 

group e by({town: ("Roma" I "Pisa") , ..> => "Italia" 
I {town: "Paris", . . } => "France" 
I _ => "?") 

if e has type [{town: string, addr : string}*] , then the 
type infen'ed by our system for this groupby expression is 

[( ("Italia", [{town: "Roma" I "Pisa" , addr : string}+] ) 
I ("France", [{town: "Paris" , addr : string}+] ) 
I ("?", [{town: string/ ("Roma" I "Pisa" I "Paris") , 
addr : string}+] ) 

)*] 

which precisely associates ewery key to the type of the elements it 
groups. 

Finally, in order to allow a modular usage of filters, adding just 
filter application to the expression of the foreign language does not 
suffice: parametric filter definitions are also needed. 

e:~fe \ let filter F [Fi, Fn] =/ in e 

However, on the line of what we already said about the disruption of 
the execution model, recursive parametric filter definitions should 
be probably disallowed, since a compilation according to a map- 
reduce model would require to disentangle recursive calls. 

8. Commentaries 

Finally, let us explain some subtler design choices for our system. 

Filter design: The reader may wonder whether products and 
record filters are really necessary since, at first sight, the filter 
(/i,/2) could be encoded as {x,y) => {fix,f2y) and similarly 
for records. The point is that fix and /21/ are expressions — and 
thus their pair is a filter — only if the /i's are closed {ie, wihtout 
free term recursion variables). Without an explicit product filter it 
would not be possible to program a filter as simple as the identity 
map, //X. 'nil 'nil|(/i, f) =J> {h,Xt) since Xt is not an ex- 
pression {X is a free term recursion variable)] Similarly, we need 
an explicit record filter to process recursively defined record types 
such as /LtX .({head:int, tail:X}| 'nil). 

Likewise, one can wonder why we put in filters only the "open" 
record variant that copy extra fields and not the closed one. The 
reason is that if we want a filter to be applied only to records with 
exactly the fields specified in the filter, then this can be simply 
obtained by a pattern matching. So the filter {li'.fi, . . . ,£n:fn} 
(ie, without the trailing ". .") can be simply introduced as syntactic 
sugar for {^i:any, . . . , ^„:any} ^ {iiifi,. . . ,i„:f„ , ..} 

Constructors: The syntax for constructing records and pairs is 
exactly the same in patterns, types, expressions, and filters. The 
reader may wonder why we did not distinguish them by using, say, 
X for product types or — instead of : in record values. This, com- 
bined with the fact that values and singletons have the same syntax, 
is a critical design choice that greatly reduces the confusion in these 
languages, since it makes it possible to have a unique representation 



Syntactically we could write /iX.'nil ^ 'nil|(h,t) ^ (h,(Xt)v) 
where v is any value, but then this would not pass type-checking since the 
expression {Xt)v must be typeable without knowing the A environment 
(cf mle [Filter-App] at the beginning of Section 14.3) . We purposedly 
stratified the system in order to avoid mutual recursion between filters and 
expressions. 
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for constructions that are semantically equivalent. Consider for in- 
stance the pattern (x-,(3, 'nil)). With our syntax (3, 'nil) denotes 
both the product type of two singletons 3 and 'nil, or the value 
(3, 'nil), or the singleton that contains this value. According to 
the interpretation we choose, the pattern can then be interpreted as 
a pattern that matches a product or a pattern that matches a value. 
If we had differentiated the syntax of singletons from that of values 
(eg, {u}) and that of pairs from products, then the pattern above 
could have been written in five different ways. The point is that 
they all would match exactly the same sets of values, which is why 
we chose to have the same syntax for all of them. 

Record types: The definition of records is redundant (both for 
types and patterns). Instead of the current definition we could 
have used just {lit}, {}, and {..}, since the rest can be en- 
coded by intersections. For instance, {l\:t\, . . . ,tn-tn ,■■} = 
{i'i:t}&...&{i'„:tn}&{..}. We opted to use the redundant defini- 
tion for the sake of clarity. In order to type records with computed 
labels we distinguished two cases according to whether the type of 
a record label is finite or not. Although such a distinction is simple, 
it is not unrealistic. Labels with singleton types cover the (most 
common) case of records with statically fixed labels. The dynamic 
choice of a label from a statically known list of labels is a usage 
pattern seen in JavaScript when building an object which must con- 
form to some interface based on a run-time value. Labels with infi- 
nite types cover the fairly common usage scenario in which records 
are used as dictionaries: we deduce for the expression computing 
the label the type string, thus forcing the programmer to insert 
some code that checks that the label is present before accessing it. 

The rationale behind the typing of records was twofold. First 
and foremost, in this work we wanted to avoid type annotations 
at all costs (since there is not even a notion of schema for JSON 
records and collections — only the notion of basic type is defined — 
we cannot expect the Jaql programmer to put any kind of type 
information in the code). More sophisticated type systems, such 
as dependent types, would probably preclude type reconstruction: 
dependent types need a lot of annotations and this does not fit our 
requirements. Second, we wanted the type-system to be simple 
yet precise. Making the finite/infinite distinction increases typing 
precision at no cost (we do not need any extra machinery since 
we already have singleton types). Adding heuristics or complex 
analysis just to gain some precision on records would have blurred 
the main focus of our paper, which is not on typing records but 
on typing transformations on records. We leave such additions for 
future work. 

Record polymorphism: The type-oriented reader will have no- 
ticed that we do not use row variables to type records, and nev- 
ertheless we have a high degree of polymorphism. Row variables 
are useful to type functions or transformations since they can keep 
track of record fields that are not modified by the transformation. In 
this setting we do not need them since we do not type transforma- 
tions (ie, filters) but just the application of transformations (filters 
are not first-class terms). We have polymorphic typing via filters 
(see how the first example given in Section|7]keeps track of the c 
field) and therefore open records suffice. 

Record selection: Some languages — typically, the dynamic ones 
such as Javascript, Ruby, Python — allow the label of a field selec- 
tion to be computed by an expression. We considered the definition 
of a fine-grained rule to type expressions of the form 61.62: when- 
ever £2 is typed by a finite unions of strings, the rule would give 
a finite approximation of the type of the selection. However, such 
an extension would complex the definition of the type system, just 
to handle few interesting cases in which a finite union type can be 
deduced. Therefore, we preferred to omit its study and leave it for 
future work. 



9. Related Work 

In the (nested) relational (and SQL) context, many works have 
studied the integration of (nested)-relational algebra or SQL into 
general puipose programming languages. Among the first attempts 
was the integration of the relational model in Pascal 131 ] or in 
Smalltalk fTlll . Also, monads or comprehensions jTl |33i |34| have 
been successfully used to design and implement query languages 
including a way to embed queries within host languages. Signif- 
icant efforts have been done to equip those languages with type 
systems and type checking disciplines d, |2^, [2ql and more re- 
cently (27I1 for integration and typing aspects. However, these ap- 
proaches only support homogeneous sequences of records in the 
context of specific classes of queries (practically equivalent to 
a nested relational algebra or calculus), they do not account for 
records with computable labels, and therefore they are not easily 
transposable to a setting where sequences are heterogeneous, data 
are semi-structured, and queries are much more expressive. 

While the present work is inspired and stems from previous 
works on the XML iterators, targeting NoSQL languages made the 
filter calculus presented here substantially different from the one 
of l9l [23il (dubbed XML filters in what follows), as well in syntax 
as in dynamic and static semantics. In @] XML filters behave as 
some kind of top-down tree transducers, termination is enforced by 
heavy syntactic restrictions, and a less constrained use of the com- 
position makes type inference challenging and requires sometimes 
cumbersome type annotations. While XML filters are allowed to 
operate by composition on the result of a recursive call (and, thus, 
simulate bottom-up tree transformations), the absence of explicit 
arguments in recursive calls makes programs understandable only 
to well-trained programmers. In contrast, the main focus of the 
current work was to make programs immediately intelligible to any 
functional programmer and make filters effective for the typing 
of sequence transformations: sequence iteration, element filtering, 
one-level flattening. The last two are especially difficult to write 
with XML filters (and require type annotations). Also, the integra- 
tion of filters with record types (absent in Q and just sketched in 
t23ll ) is novel and much needed to encode JSON transformations. 

10. Conclusion 

Our work addresses two very practical problems, namely the typ- 
ing of NoSQL languages and a comprehensive definition of their 
semantics. These languages add to list comprehension and SQL 
operators the ability to work on heterogeneous data sets and are 
based on JSON (instead of tuples). Typing precisely each of these 
features using the best techniques of the literature would probably 
yield quite a complex type-system (mixing row polymorphism for 
records, parametric polymorphism, some form of dependent typ- 
ing,...) and we are skeptical that this could be achieved without us- 
ing any explicit type annotation. Therefore we explored the formal- 
ization of these languages from scratch, by defining a calculus and 
a type system. The thesis we defended is that all operations typical 
of current NoSQL languages, as long as they operate structurally 
{ie, without resorting on term equality or relations), amount to a 
combination of more basic bricks: our filters. On the structural side, 
the claim is that combining recursive records and pairs by unions, 
intersections, and negations suffices to capture all possible struc- 
turing of data, covering a palette ranging from comprehensions, to 
heterogeneous lists mixing typed and untyped data, through regular 
expressions types and XML schemas. Therefore, our calculus not 
only provides a simple way to give a formal semantics to, recip- 
rocally compare, and combine operators of different NoSQL lan- 
guages, but also offers a means to equip these languages, in they 
current definition {ie, without any type definition or annotation), 
with precise type inference. 
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As such we accounted for both components that, according to 
Landin, constitute the design of a language: operators and data 
structures. But while Landin considers the design of terms and 
types as independent activities we, on the contrary, advocate an 
approach in which the design of former is driven by the form of 
latter. Although other approaches are possible, we tried to convey 
the idea that this approach is nevertheless the only one that yields 
a type system whose precision, that we demonstrated all the work 
long, is comparable only to the precision obtained with hard-coded 
(as opposed to user-defined) operators. As such, our type inference 
yields and surpasses in precision systems using parametric poly- 
morphism and row variables. The price to pay is that transforma- 
tions are not first class: we do not type filters but just their appli- 
cations. However, this seems an advantageous deal in the world 
of NoSQL languages where "selects" are never passed around (at 
least, not explicitly), but early error detection is critical, especially 
in the view of the cost of code deployment] 

The result are filters, a set of untyped terms that can be easily 
included in a host language to complement in a typeful framework 
existing operators with user-defined ones. The requirements to in- 
clude filters into a host language are so minimal that every modem 
typed programming language satisfies them. The interest resides 
not in the fact that we can add filter applications to any language, 
rather that filters can be used to define a smooth integration of calls 
to domain specific languages (eg, SQL, XPath, Pig, Regex) into 
general purpose ones (eg, Java, C#, Python, OCaml) so as both can 
share the same set of values and the same typing discipline. Like- 
wise, even though filters provide an early prototyping platform for 
queries, they cannot currently be used as a final compilation stage 
for NoSQL languages: their operations rely on a Lisp-like encod- 
ing of sequences and this makes the correspondence with optimized 
bulk operations on lists awkward. Whether we can derive an effi- 
cient compilation from filters to map-reduce (recovering the bulk 
semantics of the high-level language) is a challenging question. 

Future plans include practical experimentation of our technique: 
we intend to benchmark our type analysis against existing collec- 
tions of Jaql programs, gauge the amount of code that is ill typed 
and verify on this how frequently the programmer adopted defen- 
sive programming to cope with the potential type errors. 
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A. Termination analysis algorithm 

In order to deduce the result type of the apphcation of a filter to 
an expression, the type inference algorithm abstractly executes the 
filter on the type of the expression. As explained in Section |4] the 
algorithm essentially analyzes what may happen to the original 
input type (and to its subtrees) after a recursive call and checks 
that, in all possible cases, every subsequent recursive call will 
only be applied to subtrees of the original input type. In order to 
track the subtrees of the original input type, we use an infinite set 
Ids = {ii, 12, . . .} of variable identifiers (ranged over by, possible 
indexed, t). These identifiers are used to identify the subtrees of the 
original input type that are bound to some variable after a recursive 
call. Consider for instance the recursive filter: 



IJ,X.(X1,(X2,X3)) ^ X(xi,X3) 



. =>■ 'nil. 



The algorithm records that at the first call of this filter each variable 
Xi is bound to a subtree Li of the input type. The recursive call 
X{xi,xs) is thus applied to the "abstract argument" (tijta). If we 
perform the substitutions for this call, then we see that xi will be 
bound to ti and that X2 and xs will be bound to two subtrees of 
the t3 subtree. We do not care about X2 since it is not used in 
any recursive call. What is important is that both xi and 2:3 are 
bound to subtrees of the original input type (respectively, to ti and 
to a subtree of 13) and therefore, for this filter, the type inference 
algorithm will terminate for every possible input type. 

The previous example introduces the two important concepts 
that are still missing for the definition of our analysis: 

I. Recursive calls are applied to symbolic arguments, such as 
(11,13), that are obtained from arguments by replacing variables 
by variable identifiers. Symbolic arguments are ranged over by 
A and are formally defined as follows: 



Symb Args A :: — 



b (variable identifiers) 

c (constants) 

{A,A) (pairs) 

{t.A, t.A} (records) 

J_ (indeterminate) 



2. We said that we "care" for some variables and disregard others. 
When we analyze a recursive filter (iX.f the variables we care 
about are those that occur in arguments of recursive calls of X. 
Given a filter / and a filter recursion variable X the set of these 
variables is formally defined as follows 

Markxif) ~ {x \ Xa C / and x G vars(a)} 

where C denotes the subtree containment relation and vars(e) 
is the set of expression variables that occur free in e. With an 
abuse of notation we will use vars(p) to denote the capture 
variables occurring in p (thus, vars(/e) = vars(/) U vars(e) 
and vars(p =^ /) = vars(/) \ vars(p), the rest of the definition 
being standard). 

As an aside notice that the fact that for a given filter / the type 
inference algorithm terminates on all possible input types does 
not imply that the execution of / terminates on all possible input 
values. For instance, our analysis correctly detects that for the filter 
/iX.(a-i,a;2) Xi^x\,X2)\_ =>• _ type inference terminates on 
all possible input types (by returning the very same input type) 
although the application of this same filter never terminates on 
arguments of the form (vi,V2)- 

We can now formally define the two phases of our analysis 
algorithm. 

First phase. The first phase is implemented by the function 
Trees {_, _). For every filter recursion variable X, this function 
explores a filter and does the following two things: 



1. It builds a substitution cr : Vars — Ids from expression 
variables to variables identifiers, thus associating each capture 
variable occurring in a pattern of the filter to a fresh identifier 
for the abstract subtree of the input type it will be bound to. 

2. It uses the substitution cr to compute the set of symbolic argu- 
ments of recursive calls of X. 

In other words, if n recursive calls of X occur in the filter /, 
then Treesx{(T, f) returns the set {Ai, An} of the n symbolic 
arguments of these calls, obtained under the hypothesis that the free 
variables of / are associated to subtrees as specified by cr. The 
formal definition is as follows: 

TreesxicTje) = 

Treesx{cT,p=^ f) = TreesxicrU [j {xi 1-^ Li}, f) (ti fresh) 

X j Gvars(p) 

Treesx{cr, (/i,/2)) = Treesx{cr, /i) U Treesx{cr, /2) 
Treesx{(T,fi\f2) = Treesx{(T, fi) U Treesxi<T, f2) 
Trees x((y,ilX.f) — 

Trees x{cT,ixY.f) ^ Treesx{cT , f) {X ^Y) 

Treesx{cr,Xa) — {a<y} 

Trees X {cr, fi -J 2) = Treesx (f, /2 ) 

Treesx{cT,of) — Trees x{cr,f) (o = groupby , orderby ) 
Treesx{cr,{£i:fi , ..}ig/) = U Treesx{cr, fi) 

where acr denotes the application of the substitution cr to a. 

The definition above is mostly straightforward. The two impor- 
tant cases are those for the pattern filter where the substitution cr 
is updated by associating every capture variable of the pattern with 
a fresh variable identifier, and the one for the recursive call where 
the symbolic argument is computed by applying cr to the actual 
argument. 

Second phase. The second phase is implemented by the function 
Check(). The intuition is that Check{f) must "compute" the appli- 
cation of / to all the symbolic arguments collected in the first phase 
and then check whether the variables occurring in the arguments 
of recursive calls (ie, the "marked" variables) are actually bound 
to subtrees of the original type (ie, they are bound either to vari- 
able identifiers or to subparts of variable identifiers). If we did not 
have filter composition, then this would be a relatively easy task: 
it would amount to compute substitutions (by matching symbolic 
arguments against patterns), apply them, and finally verify that the 
marked variables satisfy the sought property. Unfortunately, in the 
case of a composition filter /i ;/2 the analysis is more complicated 
than that. Imagine that we want to check the property for the ap- 
plication of /i;/2 to a symbolic argument A. Then to check the 
property for the recursive calls occuning in /2 we must compute 
(or at least, approximate) the set of the symbolic arguments that 
will be produced by the application of /i to A and that will be thus 
fed into /2 to compute the composition. Therefore Check{f) will 
be a function that rather than returning just true or false, it will re- 
turn a set of symbolic arguments that are the result of the execution 
of /, or it will fail if any recursive call does not satisfy the property 
for marked variables. 

More precisely, the function CheckQ will have the form 

Checkv{cr,f, {Ai,...,A„}) 

where V C Vars stores the set of marked variables, / is a filter, cr 
is a substitution for (at least) the free variables of / into Ids, and 
Ai are symbolic arguments. It will either fail if the marked vari- 
ables do not satisfy the property of being bound to (subcomponents 
of) variable identifiers or return an over-approximation of the result 
of applying / to all the Ai under the hypothesis cr. This approxi- 
mation is either a set of new symbolic arguments, or _L. The latter 
simply indicates that Check {) is not able to compute the result of 
the application, typically because it is the result of some expression 



18 



2013/3/8 



Checkv{cr, f,{Ai, ...,An}) = (J Checkv {ct , f , Ai 



If (\A\k.lf^ = 0) then Checkv{(T, f,A) = otherwise: 
Cheeky {(T, e, A) = 



Checkv{cr,fi;f2,A) 
Cheeky (cr, Xa, A) 

Checkv{cr,fiX.f,A) 
Checkv{cr,fi\f2,A) 

Checkvi<T,(fi,f2),A) 



{aa} if e = a 
± otherwise 



Checkv{(T, /2, Checkv{cr, fi, A)) 
_L 



fail if CheckMarkx (/) (f^. ^) =/a'' 
_L otherwise 



(checkv{(T,f2,A) if lfi^&;']A\ = 

I Checkv{<T, fi,A) if]A\ \ = 

I Checkvicr, fi, A) U Checkv{cr, f2,A) otherwise 

{ Cheeky {(T, fi,Ai) X Checkv{(T,f2,A2) ifA = (^1,^2) 
Checkv{(T , fi, ti ) X Checkv{cr, f2, 1,2) if A = l (where ti and L2 are fresh) 
/a// if yl = ± 



Checkv{cT,{£i:fi , ..}i<i<„,A) = < 



U {^i:_Bi, ^„:_B„, i'„+fc:j4„+fe , ..} if A = ..}i<i<„+fe 

flieO!fc<rv(<T,/i,yli) 

U ^„+fe:y4„+fc , ..} ifyl = t (where all are fresh) 

S;eO!«-tv(<T,/i,ti) 

/a// if yl = ± 



Checkv {cr,p=> f,A) = 



1 Checkv (a U {A^/ p), f,A) otherwise 



Figure 3. Definition of the second phase of the analysis 



belonging to the host language (eg, the application of a function) or 
because it is the result of a recursive fiher or of a recursive call of 
some filter. 

The full definition of Check {_, _, _) is given in Figure[3] Let us 
comment the different cases in detail. 

n 

Checkv{tT, /, {Ai, An}) = [J Checkv{(T, f, Ai) 

i = l 

simply states that to compute the filter / on a set of symbolic argu- 
ments we have to compute / on each argument and return the union 
of the results. Of course, if any Checkv (cr , f, Ai) fails, so does 
Checkv (<T, f, {Ai, An}). The next case is straightforward: if 
we know that an argument of a given form makes the filter / fail, 
then we do not perform any check (the recursive call of / will never 
be called) and no result will be returned. For instance, if we apply 
the filter {l:f , ..} to the symbolic argument (11,12), then this will 
always fail so it is useless to continue checking this case. Formally, 
what we do is to consider ] j4 [, the set of all values that have the 
form of A (this is quite simply obtained from A by replacing any 
for every occurrence of a variable identifier or of ±) and check 
whether in it there is any value accepted by /, namely, whether the 
intersection ] yl [& |^ / j is empty. If it is so, we directly return the 
empty set of symbolic arguments, otherwise we have to perform a 
fine grained analysis according to the form of the filter 

If the filter is an expression, then there are two possible cases: 
either the expression has the form of an argument (= denotes 



syntactic equivalence), in which case we return it after having 
applied the substitution cr to it; or it is some other expression, in 
which case the function says that it is not able to compute a result 
and returns ±. 

When the filter is a composition of two filters, /i;/2, we first 
call Checkv {cr, fi, A) to analyze /i and compute the result of 
applying /i to A, and then we feed this result to the analysis of 

f2- 

Checkv {a, fi;f2, A) — Checkv {cr,f2, Checkv {cr, fi,A)) 

When the filter is recursive or a recursive call, then we are not able 
to compute the symbolic result. However, in the case of a recursive 
filter we must check whether the type inference terminates on it. So 
we mark the variables of its recursive calls, check whether its def- 
inition passes our static analysis, and return 1. only if this analysis 
— ie, CheckMarkx{f){'^' /> ^) — '^'d not fail. Notice that _L is quite 
different from failure since it allows us to compute the approxima- 
tion as long as the filter after the composition does not bind (parts 
of) the result that are 1. to variables that occur in arguments of re- 
cursive calls. A key example is the case of the filter Filter [F] 
defined in Section ISTI If the parameter filter F is recursive, then 
without _L Filter [F] would be rejected by the static analysis. 
Precisely, this is due to the composition Fx;( ' false =^ Xl\ ■ ■ ■)■ 
When F is recursive the analysis supposes that Fx produces _L, 
which is then passed over to the pattern. The recursive call XI is 
executed in the environment _L/'true, which is empty. Therefore 
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Ajx = 



A/ipi,P2) = 



A/pi\p2 



A/{£i:pi,...,£n:p„} = 



A/{£i:pi,...,£n:pn,..} 



A/t 



{x- n> A} if A = V oi A = Loi X ^ V 
fail otherwise 

' {Ai/pi)u{A2/p2) ifA = (Ai,A2) 

(il/pi) U (12/^2) if A = L (where ti, 12 are fresh) 
fail ifA = ± 

Q otherwise 



= A/pi U A/p2 

[a/pi ifA/p^^n 
1 A/p2 otherwise 



' U A,/P^ 

i<i<ri 


ifA = {£i:Ai,...,£„:A„} 


U l-i/Pi 


if A = L (where Li are fresh) 


fail 


if A = ± 


n 


otherwise 


U ^^/pi 


if yl EE {ii-.Ai, ...,£n:A„, ...,£n+k 


U f-i/Pi 


if A = t (where Li are fresh) 


fail 


if A = ± 




otherwise 



Figure 4. Pattern matching with respect to a set V of marked variables 



the result of the Fx call, whatever it is, cannot affect the termina- 
tion of the subsequent recursive calls. Even though the composition 
uses the result of Fx, the form of this result cannot be such as to 
make typechecking of Filter [F] diverge. 

For the union filter /1I/2 there are three possible cases: /i will 
always fail on A, so we return the result of /2; or /2 will never 
be executed on A since /i will never fail on A, so we return just 
the result of /i; or we cannot tell which one of /i or /2 will be 
executed, and so we return the union of the two results. 

If the filter is a product filter (/i,/2), then its accepted type 
l{.h\h)^ is a subtype of (any,any). Since we are in the case where 
5/2)5 is not empty, then A canhave just two forms: either 
it is a pair or it is a variable identifier t. In the former case we check 
the filters component-wise and return as result the set-theoretic 
product of the results. In the latter case, recall that t represents a 
subtree of the original input type. So if the application does not fail 
it means that each subfilter will be applied to a subtree of t. We 
introduce two fresh variable identifiers to denote these subtrees of 
L (which, of course are subtrees of the original input type, too) and, 
as in the previous case, apply the check component-wise and return 
the set-theoretic product of the results. The case for the record filter 
is similar to the case of the product filter. 

Finally, the case of pattern filter is where the algorithm checks 
that the marked variables are indeed bound to subtrees of the initial 
input type. What CheckQ does is to match the symbolic argument 
against the pattern and update the substitution cr so as to check the 
subfilter / in an environment with the corresponding assignments 
for the capture variables in p. Notice however that the pattern 
matching A^/ p receives as extra argument the set V of marked 
variables, so that while computing the substitutions for the capture 



variables in p it also checks that all capture variables that are also 
marked are actually bound to a subtree of the initial input type. 
A^/ pis defined in Figure|4](to enhance readability we omitted the 
V index since it matters only in the first case and it does not change 
along the definition). The check for marked variables is performed 
in the first case of the definition: when a symbolic argument A is 
matched against a variable x, if the variable is not marked (x ^ V), 
then the substitution {x 1— > A} is returned; if it is marked, then the 
matching (and thus the analysis) does not fail only if the symbolic 
argument is either a value (so it does not depend on the input type) 
or it is a variable identifier (so it will be exactly a subtree of the 
original input type). The other cases of the definition of A^/ p 
are standard. Just notice that in the product and record patterns 
when j4 is a variable identifier t, the algorithm creates fresh new 
variable identifiers to denote the new subtrees in which t will be 
decomposed by the pattern. Finally, the reader should not confound 
n and/a(7. The former indicates that the argument does not match 
the value, while the latter indicates that a marked variable is not 
bound to a value or to exactly a subtree of the initial input type 
(notice that the case Q cannot happen in the definition of CheckQ 
since A^/ pis called only when ] A[ is contained in \pl). 

The two phases together. Finally we put the two phases to- 
gether. Given a recursive filter ixX.f we run the analysis by mark- 
ing the variables in the recursive calls of X in /, running the 
first phase and then and feeding the result to the second phase: 
CheckMarkx{f){^j fjTreesx(0, f))- If this function does not fail 
then we say that ixX.f passed the check: 



CheckinX.f) 



CheckMarkx(f){^, f, Treesx{es, /)) ^ fail 
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For filters that are not recursive at toplevel it suffices to add a 
dummy recursion: Check{f) = Check{fj.X.f) with X fresh. 

Theorem 16 (Termination). If Check{f) then the type inference 
algorithm terminates for f on every input type t. 

In order to prove the theorem above we need some auxiliary 
definitions: 

Definition 17 (Plinth (Hi). A plinth 3 C Types is a set of types 
with the following properties: 

• 3 is finite 

• 3 contains any, empty and is closed under Boolean connectives 
(&,|,-) 

• for all types t — {ti,t2) in 3,ti £3 and t2 £3 

• for all types t = {liiti, . . . , Enltn, (..)} in 3, ti £ 3 for all 
i e [l..n]. 

We define the plinth of t, noted 3{t), as the smallest plinth contain- 
ing f. 

Intuitively, the plinth of t is the set of types that can be obtained 
by all possible boolean combination of the subtrees of t. Notice 
that Zl(t) is always defined since our types are regular: they have 
finitely many distinct subtrees, which (modulo type equivalence) 
can thus be combined in finitely many different ways. 

Definition 18 (Extended pHnth and support). Let t be a type and 
/ a filter. The extended plinth of t and /, noted t) is defined 
as 3(f V \l ^\zf ^) (where v ranges over values). 

The support of t and /, noted as Support{f, t), is defined as 

Supportif, t) ^' f) U y {Ao\a: Ids(A) ^ t)} 

AeTreeS0 (0,/) 

The extended plinth of t and / is the set of types that can be 
obtained by all possible boolean combination of the subtrees of t 
and of values that occur in the filter /. The intuition underlying the 
definition of support of t and / is that it includes all the possible 
types of arguments of recursive calls occurring in /, when / is 
applied to an argument of type t. 

Lastly, let us prove the following technical lemma: 

Lemma 19. Let f be a filter such that Check{f) holds. Let t be a 
type. For every derivation D (finite or infinite) of 

r ;A ;M h^, f(t) : s 
for every occurrence of the rule [Fil-Call-New] 

r' ;A' ;M', ((X, t") ^ T) hf, A'(X)(t") : t' *" = type(r', a) 

(X,t") ^ dom(M' 

r' ;A' ;M' hf, (X a){s) : fiT.t' T fresh 

in D, for every x £ vars(a), r'(a::) G 3{f,t) (or equivalently, 
"type(r',a) e Support{f,t)) 



Proof. By contradiction. Suppose that there exists an instance of 
the rule for which x £ vars(a) A x ^ 3{f,t). This means that 
r'(a;) is neither a singleton type occurring in / nor a type obtained 
from t by applying left projection, right projection or label selec- 
tion (*). But since Check(f) holds, x must be bound to either an 
identifier or a value v (see Figure ^ during the computation of 
CheckMarkx(f)i^! /> Treesx{0, /)). Since identifiers in Check(f) 
are only introduced for the input parameter or when performing a 
left projection, right projection or label selection of another identi- 
fier, this ensure that x is never bound to the result of an expression 
whose type is not in t), which contradicts (★). □ 



We are now able to prove Theorem [T^ Let f be a type and / a 
filter define 

©(/, t) = {{X, t')\X\Z f, t' e Supportif, t)} 

Notice that since Support{f, t) is finite and there are finitely many 
different recursion variables occuring in a filter, then T>(^f,t) is 
finite, too. Now let / and t be the filter and type mentioned in 
the statement of Theorem [16] and consider the (possibly infinite) 
derivation of F ;A ;M f{t) : s. Assign to every judgment 
F' ;A' ;M' hf, f'{t') : s' the following weight 

Wgt{r' ;A' ;M' h,, fit') : s') = i\V{f,t)\dom{M')\,size(f')) 

where \S\ denotes the cardinality of the set S (notice that in the 
definition S is finite), doni(M') is the domain of M' , that is, the set 
of pairs (X, t) for which M' is defined, and size{f') is the depth of 
the syntax tree of /'. 

Notice that the set of all weights lexicographically ordered form 
a well-founded order. It is then not difficult to prove that every 
application of a rule in the derivation of F ;A ;M h/,, f{t) : s 
strictly decreases Wgt, and therefore that this derivation must be 
finite. This can be proved by case analysis on the applied rule, and 
we must distinguish only three cases: 

[FiL-FiX] Notice that in this case the first component of Wgt 
either decreases or remains constant, and the second component 
strictly decreases. 

[Fil-Call-New] In this case Lemma [79l ensures that the first 
component of the Wgt of the premise strictly decreases. Since 
this is the core of the proof let us expand the rule. Somewhere in 
the derivation of F ;A ;AI h/,, f{t) : s we have the following 
rule: 

[Fil-Call-New] 

F' ;A' ;M', ((X,t") ^ T) h^,-, A'(X)(t") : t' *" = type(r',a) 

(X,t") ^ dom(M' 

F' ;A' ;M' h„ {Xa){s) : tiT.t' T fresh 

and we want to prove that (©(/, t) \ dom(M')) D (©(/, t) \ 
{dom{M') U {X,t")) and the containment must be strict to 
ensure that the measure decreases. First of all notice that 
{X, t") dom(M'), since it is a side condition for the appli- 
cation of the rule. So in order to prove that containment is strict 
it suffices to prove that {X,t") G T>{t,f). But this is a con- 
sequence of Lemma 1 1 9 1 which ensures that t" G Support{f, t), 
whence the result. 

[FiL-*] With all other rules the first component of Wgt remains 
constant, and the second component strictly decreases. 

A.l Improvements 

Although the analysis performed by our algorithm is already fine 
grained, it can be further refined in two simple ways. As explained 
in Section [4] the algorithm checks that after one step of reduction 
the capture variables occuixing in recursive calls are bound to sub- 
trees of the initial input type. So a possible improvement consists 
to try to check this property for a higher number of steps, too. For 
instance consider the filter: 

IJ.X.('nil,x) =S> X{{e:x}) 
I iy,_)^X{i'nii,iy,y))) 

I 'nil 

This filter does not pass our analysis since if ty is the identifier 
bound to the capture variable y, then when unfolding the recursive 
call in the second branch the x in the first recursive call will be 
bound to {Ly,Ly). But if we had pursued our abstract execution 
one step further we would have seen that {t,y,Ly) is not used in a 
recursive call and, thus, that type inference terminates. Therefore, 
a first improvements is to modify Check {) so that it does not stop 
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type M = (K,V) I (V,V,K) 

type V = { var : string } I { lambda2 : (string, string, M) ]■ 
type K = { var : string } I { lambdal : (string, M) }■ 

let filter Eval = 

I ( { lambda2 : (x, k, m) }, v , h ) -> m ; Subst[x,v] ; Subst[k,h]; Eval 
I ( { lambdal : (x, m) }■, v ) -> m ; Subst[x,v] ; Eval 
I X -> X 

let filter Subst[c,F] = 

I ( Subst[c,F] , Subst[c,F] , Subst[c,F] ) 

I ( Subst[c,F] , Subst[c,F] ) 

I { var : c } -> F 

I { lambda2 : (x&^c, k&^c, m) } -> { lainbda2 : (x, k, m;Subst [c,F] ) } 

I { lambdal : (x&^c, m) } -> { lambdal : (x, mjSubst [c,F] ) } 

I X -> X 



Figure 5. Filter encoding of A, 



just after one step of reduction but tries to go down n steps, with 
n determined by some heuristics based on sizes of the filter and of 
the input type. 

A second and, in our opinion, more promising improvement 
is to enhance the precision of the test ]Af&; |^ /J — in the 
definition of Check, that verifies whether the filter / fails on the 
given symbolic argument A. In the current definition the only 
information we collect about the type of symbolic arguments is 
their structure. But further type information, currently unexploited, 
is provided by patterns. For instance, in the following (admittedly 
stupid) filter 

fiX .{int,x) =>■ X 

I y&int =^ X((y,y)) 

I Z 2 

if in the first pass we associate y with Ly, then we know that {iy,i.y) 
has type (int,int). If we record this information, then we know 
that in the second pass {t,y,Ly) will always match the first pattern, 
and so it will never be the argument of a further recursive call. In 
other words, there is at most one nested recursive call. The solution 
is conceptually simple (but yields a cumbersome formalization, 
which is why we chose the current formulation) and amounts to 
modify Trees {,) so that when it introduces fresh variables it records 
their type information with them. It then just suffices to modify the 
definition of ] A[ so as it is obtained from A by replacing every 
occurrence of a variable identifier by its type information (rather 
than by any) and the current definition of Check {) will do the rest. 

B. Proof of Turing completeness (Theorem |7ll 

In order to prove Turing completeness we show how to define by 
filters an evaluator for untyped (call-by-value) A-calculus. If we al- 
lowed recursive calls to occur on the left-hand side of composition, 
then the encoding would be straightforward: just implement (3 and 
context reductions. The goal however is to show that the restric- 
tion on compositions does not affect expressiveness. To that end 
we have to avoid context reductions, since they require recursive 
calls before composition. To do so, we first translate A-terms via 
Plotkin's call-by-value CPS translation and apply Steele and Rab- 
bit's administrative reductions to them obtaining terms in Agns. The 
latter is isomorphic to cbv A-calculus (see for instance jSOll ) and 
defined as follows. 

M :■- KV I VVK 
V :■- x \ Xx.Xk.M 
K :■- k\ Xx.M 



with the following reduction rules (performed at top-level, without 
any reduction under context). 

{\x.\k.M)VK — y M[x ■- V][k ■- K] 
{Xx.M)V — ^ M[x ■- V] 

In order to prove Turing completeness it suffices to show how to 
encode A^ps terms and reductions in our calculus. For the sake 
of readability, we use mutually recursive types (rather than their 
encoding in /i-types), we use records (though pairs would have 
sufficed), we write just the recursion variable X for the filter x -> 
Xx, and use -it to denote the type any\t. Term productions are 
encoded by the recursive types given at the beginning of Figure |5] 

Next we define the evaluation filter Eval. In its body it calls the 
filter Subst [c,F] which implements the capture free substitution 
and, when c denotes a constant, is defined as right belowfl 

It is straightforward to see that the definitions in Figure|5]imple- 
ment the reduction semantics of CPS terms. Of course the defini- 
tion above would not pass our termination condition. Indeed while 
Subst would be accepted by our algorithm, Eval would fail since 
it is not possible to ensure that the recursive calls of Eval will re- 
ceive from Subst subtrees of the original input type. This is ex- 
pected: while substitution always terminate they may return trees 
that are not subtrees of the original term. 

C. Proof of subject reduction (Theorem [Oil 

We first give the proof of Lemma|9] which we restate: 

Let / be a filter and u be a value such that v ^ \f']. Then for 
every 7, 5, if 5;'y h,,,,,/ f{v) ~~> r, r = fi. 

Proof. By induction on the derivation of S;^ h„.„/ f{v) r, and 
case analysis on the last rule of the derivation: 

(expr): here, — any for any expression e, therefore this rule 
cannot be applied (since Vw, u £ any). 

(prod): assume v ^ lUuh)] = (1 /i I , 1 /2 I )• then either 
V ^ (any, any), in which case only rule (error) applies and 
therefore r = f2. Or v = {v\,V2) and ui ^ 1/iJ- Then by 
induction hypothesis on the first premise, (5;7 h„„, ri 
and ri = SI, which contradicts the side condition ri ^ H (and 



We were a little bit sloppy in the notation and used a filter parameter 
as a pattern. Strictly speaking this is not allowed in filters and, for in- 
stance, the branch { var : c } -> F in Subst should rather be writ- 
ten as { var : y } -> ( (y=c) ; (true -> F I false -> { var 
: y }) ). Similarly, for the other two cases. 
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similarly for the second premise). Therefore this rule cannot be 
applied to evaluate (/i,/2)- 

(patt): Similarly to the previous case. If u ^ |p — 
then either v ^ Ip^ (which contradicts v/p ^ Q) or v ^ 
and by induction hypotheis, S;y,v/p h„„/ f{v) ~^ ^■ 

(comp): If ?; ^ Ifilh^ = l/ij, then by induction hypothesis on 
the first premise, ri = Q, which contradicts the side condition 
r i 7^ SI. Therefore only rule (error) can be applied here. 

(reed): Similar to product type: either v is not a record and there- 
fore only the rule (errro) can be applied, or one of the ri — ^1 
(by induction hypothesis) which contradicts the side condition. 

(unionl) and (union2): since v ^ l.fi|,/2l, this means that v ^ 
l/ij and V ^ 1/2 1- Therefore if rule (unionl) is chosen, by 
induction hypothesis, ri = SI which contradicts the side con- 
dition. If rule (union2) is chosen, then by induction hypothesis 
r2 = Q which gives r = Q. 

(rec-call) trivially true, since it cannot be that v ^ any 

(rec) we can apply straightfowardly the induction hypothesis on 
the premise and have that r = fl. 

(groupby) and (orderby): Ifv^f then the only rule that applies 
is (error) which gives r = Q. 

□ 

We are now equipped to prove the subject reduction theorem 
which we restate in a more general manner: 

For every F, A, AI, 7, S such that Va; £ doin{'y), x G dom{V) A 
7(3;) : r(a-), if r ;A ;A/ hy,, f{t) : s, then for all v : t, 
5;7 ^eio/ f{v) r implies r : s. 

The proof is by induction on the derivation of (5;7 !-„,„/ /(«) ^ 
r, and by case analysis on the rule. Beside the basic case, the only 
rules which are not a straightforward application of induction are 
the rules unionl and union2 that must be proved simulatneously. 
Other cases being proved by a direct application of the induction 
hypothesis, we only detail the case of the product constructor. 

(expr): we suppose that the host languages enjoys subject reduc- 
tion, hence e{v) — r : s. 

(prod): Here, we know that t = Vi<raiik(t)(^i' ^2)- Since w is a 
value, and d : t, we obtain that v = {vi,V2)- Since {vi,V2) 
is a value, 3i < rank(t) such that {vi,V2) : (ti,t2)- The 
observation that w is a value is crucial here, since in general 
given a type t' < t with t' = Vi<rank(t') ) " is not 

true that 3i, j s.t. (t'/ , t2^) < {ti\ t2^). However this property 
holds for singleton types and therefore for values. We have 
therefore that ?;i : t\ and V2 : t\. Furthermore, since we 
suppose that a typing derivation exists and the typing rules are 
syntax directed, then F ;A ;M h^, f{t\) '■ must be a used 
to prove our typing judgement. We can apply the induction 
hypothesis and deduce that <5;7 \-„.„i f{vi) ri and similarly 
for V2. We have therefore that the filter evaluates to {ri,r2) 
which has type (si, S2) < Vj<ranit(s) '^2') which proves 
this case. 

(unionl) and (union2): both rules must be proved together. In- 
deed, given a filter /i | /2 for which we have a typing derivation 
for the judgement F ;A ;AI \-fji /i|/2(t) : s, either t> : t& | 
/i j and we can apply the induction hypothesis and therefore, 
F ;A ;M h,-, I M) : si and if S;^ h„„, f{vi) ^ n 

(case unionl) ri : Si. However it might be that v ^ tSz\ /i j. 
Then by Lemma |9] we have that (5;7 h„„, /(wi) ^ SI (case 
union2) and we can apply the induction hypothesis on the sec- 
ond premise, which gives us 5;7 \-„.„i f{vi) r2 which allows 
us to conclude. 



(error): this rule can never occur. Indeed, if v : t and F ;A ;Af l-f, 
fit) : s, that means in particular that t < and therefore 
that all of the side conditions for the rules prod, pat and reed 
hold. 

D. Complexity of the typing algorithm 

Proof. For clarity we restrict ourselves to types without record 
constructors but the proof can straihgtforwardly be extended to 
them. First remark that our types are isomorphic to alternating tree 
automata (ATA) with intersection, union and complement (such 
as those defined in ifiol [TtII ). From such an ATA t it is possible 
to compute a non-deterministic tree automaton t' of size 0(2'''). 
When seeing t' in our type formalism, it is a type generated by the 
following grammar: 

T ::= fiX.T I T\const \ const 

const ;:= (tjt) | atom 
atom ::= X \ b 

where b ranges over negation and intersections of basic types. In- 
tuitively each recursion variable X is a state in the NTA, a finite 
union of products is a set of non-deterministict transitions whose 
right-hand-side are each of the products and atom productions are 
leaf-states. Then it is clear than if Check{f) holds, the algorithm 
considers at most |/| x \t'\ distinct cases (thanks to the memoiza- 
tion set M). Furthermore for each rule, we may test a subtyping 
problem (eg, t < \f]), which itself is EXPTIME thus giving the 
bound □ 

E. Precise typing of groupby 

The process of infeixing a precise type for groupby f{t) is decom- 
posed in several steps. First we have to compute the set ©(/) of 
discriminant domains for the filter /. The idea is that these domains 
are the types on which we know that / may give results of differ- 
ent types. Typically this corresponds to all possible branchings that 
the filter / can do. For instance, if {ti, ...,tn} are pairwise disjoint 
types and / is the filter ti ei| • ■ ■ |t„ =J> e„, then the set of 
discriminant domains for the filter / is {t\, since the vari- 
ous Si obtained from ;0 ;0 h^, fiti) : Si may all be different, 
and we want to keep track of the relation between a result type Si 
and the input type ti that produced it. Formally ©(J) is defined as 
follows: 



V{e) = {any} 

0((/i,/2)) = 2?(/i) X VU2) 

Vip^f) = {lp\kt\teV{f)} 

25(/i|/2) = V{fi) U {AIM I t G 2^(/2)} 

Vi^tX.f) = V{f) 

V{Xa) = {any} 

25(/i;/2) = V{f,) 

TD{of) = { [any*] } {o — groupby , orderby ) 

V{{£^:f^ , = IJ {e^:U , 



Now in order to have a high precision in typing we want to 
compute the type of the intermediate list of the groupby on a set of 
disjoint types. For that we define a normal form of a set of types that 
given a set of types computes a new set formed of pairwise disjoint 
types whose union covers the union of all the original types. 

mu\^es})= U (fjtA U 

0C/CS isi jes\i 

Recall that we are trying to deduce a type for groupby f{t), that 
is for the expression groupby / when applied to an argument of 
type t. Which types shall we use as input for our filter / to compute 
the type of the intermediate result? Surely we want to have all the 
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types that are in the item(f). This however is not enough since we 
would not use the information of the descriminant domains for the 
filter For instance if the filter gives two different result types for 
positive and negative numbers and the input is a list of integers, 
we want to compute two different result types for positive and 
negatives and not just to compute the result of the filter application 
on generic integers (indeed item( [int*] ) = {int}). So the idea 
is to add to the set that must be normalized all the types of the set 
of discriminant types. This however is not precise enough, since 
these domains may be much larger than the item types of the input 
list. For this reason we just take the part of the domain types that 
intersects at least some possible value of the list. In other terms we 
consider the normal form of the following set 

item(t) U {(Ji A \J ti \ Oi £ item(f)} 

*,6D{/) 

The idea is then that if {ti}i^i is the normalized set of the set 
above, then the type of grouby is the type [ Vigj (/(^O; 
with the optimization that we can replace the * by a + if we know 
that the input list is not empty. 

F. Comparison with top-down tree transducers 

We first show that every Top-down tree transducers with regular 
look-ahead can be encoded by a filter. We use the definition of fT4ll 
for top-down tree transducers: 

Definition 20 (Top-down Tree transducer witli regular look-a- 
head). A top-down tree transducer with regular look-ahead (TDTTR) 
is a 5-tuple < E, A, Q, Qd, R > where E is the input alphabet, 
A is the output alphabet, Q is a set of states, Qd the set of initial 
states and R a set of rules of the form: 

q{a{xi, . . . , Xn)), D b{qi{xii), . . . , qm{xi^)) 

where a £ E„, fe G A^, q, qi,. . . , £ Q, Vj £ l..m, ij £ l..n 
and D is a mapping from {xi , . . . , a;^ } to 2"^^ (Te being the set of 
regular tree languages over alphabet E). A TDTTR is deterministic 
if Qd is a singleton and the left-hand sides of the rules in R are 
pairwise disjoint. 

Since our filters encode programs, it only make sense to com- 
pare filters and deterministic TDTTR. We first show that given such 
a deterministic TDTTR we can write an equivalent filter / and, fur- 
themore, that Check{ f) does not fail. First we recall the encoding 
of ranked labeled trees into the filter data-model: 

Definition 21 (Tree encoding). Let t £ T^. We define the tree- 
encoding of t and we write |t] the value defined inductively by: 



|a(ti, 



= ('a,[]) 

= fill 



Va £ Eo 
] ) Va £ E„ 



where the list notation [ ui . . . u„ ] is a short-hand for {v\ , 
, 'nil)). We generalize this encoding to tree lan- 
guages and types. Let S C Ta, we. write |5| the set of values 
such that £ S, |t] £ [51. 

In particular it is clear than when S is regular, [S] is a type. 

Lemma 22 (TDTTR filters). Let T =< E, A, Q, {go}, -R > 
be a deterministic TDTTR. There exists a filter fx such that: 

vt£rfom(r),[r(t)i = /TW 

Proof. The encoding is as follows. For every state g, £ Q, we 
will introduce a recursion variable Xi . Formally, the translation is 
performed by the function TR : 2'^ x Q ~^ Filters defined as: 



where every rule r j £ R 

rj = qi{aj{xi, x„)), Dj ^ bj{qj^ {xj^^ ),..., qj^ {xj^^ )) 
is translated into: 



xMlDjix^)} ]) 



, [] 

if bj £ Ao 



('b„(x,,^ ;TR{S', q,,U . . . ,ix,,^ ;TRiS', g„J,'nil)))) 
whereS*' = S U {g, } otherwise 

The fact that \/t £ dom{T), |r(f)] = TR{0, go)|[tI is proved by a 
straightforward induction on t. The only important point to remark 
is that since T is deterministic, there is exactly one branch of the 
alternate filter fi \ . ■ .\fn that can be selected by pattern matching 
for a given input v. For as to why Check^fr) holds, it is sufficient 
to remark that each recursive call is made on a strict subtree of the 
input which guarantees that Check { fx) returns true. □ 

Lemma 23 (TDTTR ^ filters). Filters are strictly more expres- 
sive than TDTTRs. 

Proof. Even if we restrict filters to have the same domain as TDT- 
TRs (meaning, we used a fixed input alphabet E for atomic types) 
we can define a filter that cannot be expressed by a TDTTR. For 
instance consider the filter: 



y '■ 



► (AtX.('a,'nil) : 
|('b,'nil): 
|('a,M) = 
l('b,M) = 



>y 
>y 

('a, [ Xx ])) 
('b, [ Xx ])) 



This filter works on monadic trees of the form iti (. . .it,i_i(ti„) . . .) 
where Ui £ {a, 6}, and essentially replaces the leaf ti„ of a tree 
f by a copy of t itself. This cannot be done by a TDTTR. Indeed, 
TDTTR have only two ways to "remember" a subtree and copy it. 
One is of course by using variables; but their scope is restricted to a 
rule and therefore an arbitrary subtree can only be copied at & fixed 
distance of its original position. For instance in a rule of the form 
q{a{x)),D —7- a{qi{x),b{b{qi(x)))), assuming than gi copies its 
input, the copy of the original subtree is two-levels down from its 
next sibling but it cannot be arbitrary far. A second way to copy 
a subtree is to remember it using the states. Indeed, states can en- 
code the knowledge that the TDTTR has accumulated along a path. 
However, since the number of states is finite, the only thing that a 
TDTTR can do is copy a fixed path. For instance for any given n, 
there exists a TDTTR that performs the transformation above, for 
trees of height n (it has essentially 2" — 1 states which remember 
every possible path taken). For instance for n = 2, the TDTTR is: 



qo{a{x)),D 
qi{aO),D 

qo{b{x)),D 
q2{a{)),D 
12m, D 



a{qi{x)) 

a{a) 

a{b) 

6(52 (x-)) 

b{a) 

bib) 



TRiS,qi 
TR{S,qi 



= X =^ Xi X 



Ifn) 



if ( 
iff 



■i £ S 

% i s 



where E = A = {a, fe}, Q = {gO, gl,g2}, Qd = {go} and 
D = {x n- Te}. It is however impossible to write a TDTTR that 
replaces a leaf with a copy of the whole input, for inputs of arbitrary 
size. A similar example is used in O to show that TDTTR and 
bottom-up tree transducers are not comparable. □ 

G. Operators on record types. 

We use the theory of records defined for CDuce. We summarize 
here the main definitions. These are adapted from those given in 
Chapter 9 of Alain Frisch's PhD thesis ifol where the interested 
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reader can also find detailed definitions of the semantic interpre- 
tation of record types and of the subtyping relation it induces, the 
modifications that must be done to the algorithm to decide it, finer 
details on pattern matching definition and compilation techniques 
for record types and expressions. 

Let Z denote some set, a function r : £ — >■ Z is quasi-constant 
if there exists z £ Z such that the set {£££ \ r{i) 7^ z} is finite; in 
this case we denote this set by dom(7-) and the element z by def(r). 
We use £ —o Z lo denote the set of quasi-constant functions from 
£ to Z and the notation {£1 = zi, . . . ,ln — z„,_ — z} lo denote 
the quasi-constant function r : £ —> Z defined by r{£i) — Zi for 
i = l..n and r{£) = z for ^ G £ \ {^1, . . .,£„}. Although this 
notation is not univocal (unless we require Zi 7^ z), this is largely 
sufficient for the puiposes of this section. 

Let _L be a distinguished constant, then the sets string —> 
Types U {_L} and string —o Values U {!.} denote the set of 
all record types expressions and of all record values, respectively. 
The constant 1. represents the value of the fields of a record that 
are "undefined". To ease the presentation we use the same notation 
both for a constant and the singleton type that contains it: so when 
_L occurs in £ — > Values U {!.} it denotes a value, while in 
string — 1> Types U {±} it denotes the singleton type that contains 
only the value _L. 

Given the definitions above, it is clear that the record types 
in Definition [2] are nothing but specific notations for some quasi- 
constant functions in string —> Types U {-L}. More precisely, 
the open record type expression {£i:ti, . . . ,£n'tn ,..} denotes the 
quasi-constant function {£1 = fi, . . . = = any} while 
the closed record type expression {£\:ti, . . . ,£nitn} denotes the 
quasi-constant function {£1 = ti,...,^„ = tn,_ — I.}. Simi- 
larly, the optional field notation {...,£? -.t, ...} denotes the record 
type expressions in which £ is mapped either to 1. or to the type t, 
thatis, {...,^ = f|_L, ...}. 

Let t be a type and ri , r2 two record type expressions, that is 
ri, r2 : string —t- Types U {-i-}- The merge of ri, and r2 with 
respect to t, noted ©t and used infix, is the record type expression 
defined as follows: 

Cr, ffi, rn-](£-]- i ''i(^) ri{£)kt < empty 

[riWtr2)(t) - <^ (r 1 (^) \ I ra {£) otherwise 

Recall that by Lemma [TTI a record type (ie, a subtype of {..}) is 
equivalent to a finite union of record type expressions (ie, quasi- 
constant functions in string —> Types U {^}). So the definition 
of merge can be easily extended to all record types as follows 

Finally, all the operators we used for the typing of records in the 
rules of Section l431 are defined in terms of the merge operator: 



3. if r2{£) may be undefined (ie, r2{£) = t\l. for some type t), 
then we take the union of the two corresponding fields since 
it can results either in ri{£) or r2{£) according to whether the 
record typed by r2 is undefined in £ or not: (vi + r2){£) = 
n{£)\ir2{£)\±). 

This explains all the examples we gave in the main text. In partic- 
ular, {a:int, 6:int} + {a?:bool} = {a:int|bool, 6:int} since 
"a" may be undefined in the right hand-side record while "6" is 
undefined in it, and {a:int} + {..} = {..}, since "a" in the right 
hand-side record is defined (with a any) and therefore has pri- 
ority over the corresponding definition in the left hand-side record. 

H. Encoding of Co-grouping 



As shown in Section[521 our groupby operator can encode JaQL's 
group each a; = e as y by e', where e computes the grouping key, 
and for each distinct key, e' is evaluated in the environment where 
X is bound to the key and y is bound to the sequence of elements in 
the input having that key. Co-grouping is expressed by: 

group 

h hj X = ei as j/i 



l„ hj X = 
into e 



Gn as Un 



tl+t2 



t2 ffi± tl 



t\£ = {£ = _L,_ = co} 



t 



(1) 
(2) 



Co-grouping is encoded by the following composition of filters: 
[ Pi] • • • Pn] ] ; 

[ Transf orm[x ^ (1, x)] ... Transform [a:: ^ (n, a::)] ]; 
ExpEoid; 

groupby ((1,^) ^ |ei] | . . . \{nj) ^ |e„]) ; 
TransformE {x,l) ^( [(Z;Rgrpl) ...(/; Rgrpn)] ; 

where 

let filter Rgrpi = 'nil => 'nil 

I ((i,x),tail) => (x , Rgrpi tail) 

I _ => Rgrpi tail 

Essentially, the co-grouping encoding takes as argument a sequence 
of sequences of values (the n sequences to co-group). Each of 
these n sequence, is tagged with an integer i. Then we flatten this 
sequences of tagged values. We can on this single sequence apply 
our groupby operator and modify each key selector so that it is 
only applied to a value tagged with integer i. Once this is done, we 
obtain a sequence of pairs (k, I) where k is the commen grouping 
key and I a sequence of tagged values. We only have to apply 
the auxiliary Regrp filter which extracts the n subsequences from 
/ (tagged with l..n) and removes the integer tag for each value. 
Lastly we can call the recombining expression e which has in scope 
X bound to the current grouping key and yi ,. . . ,yn bound to each of 
the sequences of values from input h, . . .,l„ whose grouping key 



where co is any constant different from 1. (the semantics of the 
operator does not depend on the choice of co as long as it is different 
from _L). 

Notice in particular that the result of the concatenation of two 
record type expressions ri + r2 may result for each field £ in three 
different outcomes: 

L if r2 {£) does not contain 1. (ie, the field £ is surely defined), then 
we take the coixesponding field of 7-2: (ri + r2){£) = r2{£) 

2. if r2{£) is undefined (ie, r2{£) = -L), then we take the corre- 
sponding field of Ti: (ri + r2){£) — ri{£) 
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